[go: up one dir, main page]

0% found this document useful (0 votes)
41 views131 pages

Full Notes 201 A Fall 2022

Uploaded by

mosesnaliaka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views131 pages

Full Notes 201 A Fall 2022

Uploaded by

mosesnaliaka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 131

STAT 201A - Introduction to Probability at

an advanced level
All Lectures
Fall 2022, UC Berkeley

Aditya Guntuboyina

December 2, 2022

Contents

1 Lecture One 4
1.1 What is Probability Theory? . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 How does Probability Theory work? . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Rules of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Example 1: Testing and Covid . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Example 2: Spots on a patient . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Lecture Two 7
2.1 Example 3: Prisoner’s dilemma . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Example 4: Monty Hall Problems . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Example 5: MacKay sequence example . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Interpretation of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.1 Frequentist/Objective Understanding of Probability . . . . . . . . . . 12

3 Lecture Three 13
3.1 Interpretation of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.1 Frequentist/Objective Understanding of Probability . . . . . . . . . . 13
3.1.2 Subjective or Bayesian Understanding of Probability . . . . . . . . . . 14
3.1.3 Rules of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.4 Product Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.5 Sum Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 Lecture Four 19
4.1 Recap: Derivation of the Rules of Probability for Subjective Probability . . . 19
4.2 Probability Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3 Urn Problems: Hypergeometric Distribution . . . . . . . . . . . . . . . . . . . 24

5 Lecture Five 26
5.1 The Hypergeometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.1.1 Mean and Variance of the Hypergeometric Distribution . . . . . . . . 27
5.2 Inverse Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.2.1 Case 1: N is known and R is unknown . . . . . . . . . . . . . . . . . . 28
5.2.2 Case 2: N is unknown and R is known . . . . . . . . . . . . . . . . . . 30

1
5.2.3 Case 3: Both N and R are unknown . . . . . . . . . . . . . . . . . . . 31

6 Lecture Six 32
6.1 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.1.1 Independence of Random Variables . . . . . . . . . . . . . . . . . . . . 32
6.2 Common Discrete Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.2.1 Bernoulli Ber(p) Distribution . . . . . . . . . . . . . . . . . . . . . . . 32
6.2.2 Binomial Bin(n, p) Distribution . . . . . . . . . . . . . . . . . . . . . 33
6.2.3 Negative Binomial NB(n, p) distribution . . . . . . . . . . . . . . . . . 34

7 Lecture Seven 36
7.1 Geometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.2 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.3 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7.4 Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7.5 The Gaussian or Normal Distribution . . . . . . . . . . . . . . . . . . . . . . 39
7.5.1 The Gauss Derivation of the Normal Distribution . . . . . . . . . . . . 40

8 Lecture Eight 42
8.1 The Normal Distribution as an Approximation to the Binomial Distribution . 42
8.1.1 Stirling Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . 42
8.1.2 Entropy Approximation of Bin(n, p) . . . . . . . . . . . . . . . . . . . 43
8.1.3 Normal Approximation of Bin(n, p) . . . . . . . . . . . . . . . . . . . 44
8.1.4 Implication for the chi-squared test . . . . . . . . . . . . . . . . . . . . 45

9 Lecture Nine 46
9.1 Normal Approximation for the Binomial: CLT . . . . . . . . . . . . . . . . . 46
9.1.1 De Moivre-Laplace Central Limit Theorem . . . . . . . . . . . . . . . 47
9.2 The Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
9.3 The Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
9.4 Variable Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

10 Lecture Ten 51
10.1 Variable Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
10.2 The Cumulative Distribution Function and the Quantile Transform . . . . . . 53

11 Lecture Eleven 55
11.1 Joint Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
11.2 Joint Densities under General Linear Invertible transformations . . . . . . . . 59
11.2.1 Linear Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . 59
11.2.2 Invertible Linear Transformations . . . . . . . . . . . . . . . . . . . . . 59

12 Lecture Twelve 60
12.1 Last Class: Joint Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
12.2 Marginal Densities corresponding to a Joint Density . . . . . . . . . . . . . . 60
12.3 Independence in terms of Joint Densities . . . . . . . . . . . . . . . . . . . . . 61
12.4 How linear transformations change joint densities . . . . . . . . . . . . . . . . 62
12.5 General Invertible Transformations . . . . . . . . . . . . . . . . . . . . . . . . 62
12.6 The Herschel-Maxwell Derivation of the Normal Distribution . . . . . . . . . 63

13 Lecture Thirteen 66
13.1 Joint Density under Transformations . . . . . . . . . . . . . . . . . . . . . . . 66

2
13.2 Conditional Densities for Continuous Random Variables . . . . . . . . . . . . 67
13.2.1 Conditional Density is Proportional to Joint Density . . . . . . . . . . 68
13.2.2 Conditional Densities and Independence . . . . . . . . . . . . . . . . . 68
13.2.3 Law of Total Probability for Continuous Random Variables . . . . . . 69
13.2.4 Bayes Rule for Continuous Random Variables . . . . . . . . . . . . . . 71

14 Lecture Fourteen 71
14.1 Recap: Last Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
14.2 Law of Total Probability (LTP) and Bayes Rule for Continuous Variables . . 71
14.3 LTP and Bayes Rule for general random variables . . . . . . . . . . . . . . . 73
14.3.1 X and Θ are both discrete . . . . . . . . . . . . . . . . . . . . . . . . 74
14.3.2 X and Θ are both continuous . . . . . . . . . . . . . . . . . . . . . . . 74
14.3.3 X is discrete while Θ is continuous . . . . . . . . . . . . . . . . . . . . 74
14.3.4 X is continuous while Θ is discrete . . . . . . . . . . . . . . . . . . . . 74

15 Lecture Fifteen 75
15.1 LTP and Bayes Rule for general random variables . . . . . . . . . . . . . . . 75
15.1.1 X and Θ are both discrete . . . . . . . . . . . . . . . . . . . . . . . . 75
15.1.2 X and Θ are both continuous . . . . . . . . . . . . . . . . . . . . . . . 75
15.1.3 X is discrete while Θ is continuous . . . . . . . . . . . . . . . . . . . . 76
15.1.4 X is continuous while Θ is discrete . . . . . . . . . . . . . . . . . . . . 76
15.2 A Simple Model Selection Application . . . . . . . . . . . . . . . . . . . . . . 76
15.3 Model Selection with unknown parameters . . . . . . . . . . . . . . . . . . . . 77
15.3.1 Considering one more model . . . . . . . . . . . . . . . . . . . . . . . 81

16 Lecture Sixteen 82
16.1 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
16.1.1 Law of Iterated/Total Expectation . . . . . . . . . . . . . . . . . . . . 83
16.1.2 Application of the Law of Total Expectation to Statistical Risk Mini-
mization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
16.2 Conditional Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

17 Lecture Seventeen 87
17.1 Univariate normal and t densities . . . . . . . . . . . . . . . . . . . . . . . . . 87
17.2 Random Vectors and Covariance Matrices . . . . . . . . . . . . . . . . . . . . 89
17.3 Multivariate Normal and t-densities . . . . . . . . . . . . . . . . . . . . . . . 89
17.4 Bayesian Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

18 Lecture Eighteen 95
18.1 Recap: Multivariate Normal and t Distributions . . . . . . . . . . . . . . . . . 95
18.2 Application to Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . 96
18.3 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
18.4 Models with Nonlinear Parameter Dependence . . . . . . . . . . . . . . . . . 100

19 Lecture Nineteen 100


19.1 Last Class: Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
19.2 Nonlinear Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
19.3 More on Nonlinear Regression Models . . . . . . . . . . . . . . . . . . . . . . 104

20 Lecture Twenty 106


20.1 Last Class: Nonlinear Regression Models with both linear and nonlinear pa-
rameter dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
20.2 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

3
21 Lecture Twenty One 110
21.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
21.2 Details behind the Newton Algorithm for computing the MLE . . . . . . . . 112

22 Lecture Twenty Two 113


22.1 Linear Regression Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
22.2 Linear Regression with Gaussian prior . . . . . . . . . . . . . . . . . . . . . . 114
22.3 Linear Regression on an Earnings Dataset . . . . . . . . . . . . . . . . . . . . 115
22.4 Choosing the tuning parameter τ . . . . . . . . . . . . . . . . . . . . . . . . . 116
22.5 Additional Comments and References . . . . . . . . . . . . . . . . . . . . . . 118

23 Lecture Twenty Three 118


23.1 Comments on the Coefficient Interpretation in Last Class’s Regression Model 118
23.2 Comments on Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

24 Lecture Twenty Four 121


24.1 Central Limit Theorem (CLT) . . . . . . . . . . . . . . . . . . . . . . . . . . 121
24.2 CLT Proof strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
24.3 Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
24.3.1 z-Transform (Probability Generating Function) . . . . . . . . . . . . . 123
24.3.2 Laplace Transform (Moment Generating Function) . . . . . . . . . . . 124
24.3.3 Fourier Transform (Characteristic Function) . . . . . . . . . . . . . . . 127

25 Lecture Twenty Five 128


25.1 Recap: Last Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
25.2 CLT proof via the Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . 129
25.3 Closing Thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

1 Lecture One

1.1 What is Probability Theory?

Probability theory is what one should use when reasoning in the presence of uncertainty.

1.2 How does Probability Theory work?

Suppose we are interested in knowing whether a certain proposition is true. Suppose that we
do not have access to full information that would allow us to conclusively determine whether
the proposition is true or not. Probability theory allows us to determine a number between
0 and 1 representing how likely it is that the proposition is true based on the available
information. This is achieved by the following two steps:

1. Step One: The available information that we either possess or that we assume for the
sake of argument is converted into numerical assignments for the probabilities of
certain basic or elementary propositions. This step is often referred to as the modeling
step.

2. Step Two: Based on the probability model, we calculate probabilities of the proposi-
tions of interest using the rules of probability.

4
1.3 Rules of Probability

Probabilities are assigned to propositions (also known as events). Every probability is con-
ditional on some information (this could be available information or some information that
we assume for the sake of argument). We shall denote the probability of a proposition A
conditioned on some information I by P(A | I). When the information I is clear from con-
text, we sometimes omit it and write the probability P(A | I) as simply P(A). Even when
we do this, it should always be kept in mind that probabilities are always conditioned on
some information.

1. The probability of a proposition always lies between 0 and 1. The probability of an


impossible proposition is 0 and the probability of a certain proposition is 1.

2. Product Rule: P(A ∩ B | I) = P(A | I)P(B | A, I) = P(B | I)P(A | B, I). Here A ∩ B


is the proposition: “both A and B are true”. Also P(A | B, I) is the probability of A
conditioned on the truth of the proposition B as well as the information I. A direct
consequence of the product rule is:
P(A | B, I)P(B | I)
P(B | A, I) = .
P(A | I)
The above formula is known as the Bayes rule.

3. Sum Rule: P(A ∪ B | I) = P(A | I) + P(B | I) for disjoint propositions A and B.


Here A ∪ B denotes the proposition: “at least one of A and B is true”.

We shall see some justification for these rules later.

1.4 Example 1: Testing and Covid

Problem 1.1. Suppose I just tested positive for Covid. Do I really have Covid?

This is a situation involving uncertainty mainly because the test may not be 100% accurate.
In other words, my result could be a false positive. I need to calculate

P{I have Covid | I tested positive+other background information}

which we abbreviate as P(C | +, B) for simplicity of notation. Here B denotes relevant


background information that I may have. For example, B could include things like “I have
been strictly quarantining for the past 3 weeks” etc.

We can attempt to calculate P(C | +, B) as:


P(C, + | B) P(+ | C, B)P(C | B)
P(C | +, B) = =
P(+ | B) P(+ | C, B)P(C | B) + P(+ | C c , B)P(C c | B)
where we used the product rule and sum rule of probability. Here C c denotes the proposition
that I do not have Covid.

In order to proceed further, we need some probability assignment. Consider the following
assignment:

P(C | B) = 0.02 P(+ | C, B) = P(+ | C) = 0.99 P(+ | C c , B) = P(+ | C c ) = 0.04. (1)

P(C | B) represents the probability of Covid based on background information alone. The
fact that it is low (0.02) is meaningful when I know that I have been largely isolating myself

5
for the past few weeks. With this assignment, we can calculate the required probability
P(C | +, B) as follows:
P(+ | C, B)P(C | B)
P(C | +, B) =
P(+ | C, B)P(C | B) + P(+ | C c , B)P(C c | B)
0.99 ∗ 0.02
= = 0.3356.
0.99 ∗ 0.02 + 0.04 ∗ 0.98
Note that 0.3356 (33.56%) is not very high even though the test has very good false positive
and false negative rates. This is because P(C | B) (which can be interpreted as probability
of having Covid without taking into the account the test result) is very low (0.02).

Here is an alternative method of reasoning in this problem. We formulate this as a hy-


pothesis testing problem with

H0 : I do not have Covid versus H1 : I have Covid

The p-value in the above testing problem equals:

P{+|H0 } = P(+|C c ) = 0.04.

Usage of the naive cutoff 0.05 on the p-value would now lead to rejection of the null hypothesis
and declaring that I have Covid. On the other hand, the previous argument (based on
probability theory) gave a much higher probability to me not having Covid. This p-value
based method does not even make use of the information given on P(C | B) and P(+|C, B).
It only makes use of P(+|C c ). Note that what we are after is P(C c |+) (or P(C|+)). In
general, P(A|B) and P(B|A) can be quite different. Consider, for example, the case where
A represents the event that a person is dead and B represents the event that they were
hanged. It is therefore quite problematic that one can say something about C|+ or C c |+
from knowledge of P(+|C c ) alone.

Methods such as testing based on p-values (and putting arbitrary cutoffs on them) are not
based on probability theory. The use of p-values has been linked to serious issues such as
lack of reproducibility. In this context, we can calculate the probability of reproducibility of
the positive test:

P(+2 |+1 , B) = P(+2 |C, +1 , B)P(C|+1 , B) + P(+2 |C c , +1 , B)P(C c |+1 , B).

Here +2 denotes the proposition that the second test results in a positive (and +1 denotes the
proposition that the first test resulted in a positive). We now make the following probability
assignment:

P(+2 |C, +1 , B) = P(+2 |C) = 0.99 and P(+2 |C c , +1 , B) = P(+2 | C c ) = 0.04.

This assumption means that conditional on my Covid status, the two tests are independent.
Using this assignment, it is straightforward to calculate the reproducibility probability as
follows (note that we already calculated P(C | +1 , B) = 1 − P(C c | +1 , B) = 0.3356)

P(+2 |+1 , B) = 0.99 ∗ 0.3356 + 0.04 ∗ (1 − 0.3356) = 0.35882.

Thus this positive test wil be reproducible with probability only 35.88%.

1.5 Example 2: Spots on a patient

Problem 1.2 (From the book “A tutorial introduction to Bayesian Analysis” by James
Stone). Suppose you are a doctor confronted with a patient who is covered in spots. Ths

6
patient’s symptoms are consistent with chickenpox but they are also consistent with another,
more dangerous, disease, smallpox. How would you decide if they have chickenpox or small-
pox?

This is again a situation involving uncertainty as the doctor does not know which disease
the patient has. The doctor needs to calculate the probability:

P{smallpox | spots + B} (2)

where B again represents background information. For example, B could represent any
other symptoms that the patient has such as fever. Here is one probability assignment which
allows us to calculate this probability:

P{spots | smallpox, B} = 0.9 P{spots | chickenpox, B} = 0.8 P{spots | neither, B} = 0


(3)
and

P{smallpox | B} = 0.001 P{chickenpox | B} = 0.1 P{neither | B} = 0.899. (4)

Here “neither” refers to an underlying cause for the patient’s condition that is neither small-
pox nor chickenpox.

Using this assignment, the required probability (2) can be calculated via Bayes rule and
this leads to

P{smallpox | spots, B} ≈ 0.011 P{chickenpox | spots, B} ≈ 0.988 P{neither | spots, B} = 0.

So probability theory with the assignment (3) and (4) says that it is highly likely that the
patient has chickenpox (smallpox is basically ruled out because it is extremely rare).

Here is an alternative way of solving this problem using maximum likelihood estimation.
The maximum likelihood estimate in this case is smallpox because smallpox leads to a higher
probability (0.9) of the observed data (spots) compared to chickenpox (0.8). Maximum
Likelihood (widely used in statistics) is not based on probability theory and also seems to
be based on the wrong conditional probabilities P{spots|smallpox} and P{spots|chickenpox}
while we really should be calculating P{smallpox|spots} and P{chickenpox|spots}.

2 Lecture Two

We shall continue our discussion of the scope of probability theory with more examples.

2.1 Example 3: Prisoner’s dilemma

The following is a standard problem (see, for example, Mosteller [4, Problem 13]).

Example 2.1 (From Mosteller’s book (Problem 13; The Prisoner’s Dilemma)). Three pris-
oners, A, B, and C, with apparently equally good records have applied for parole. The parole
board has decided to release two of the three, and the prisoners know this but not which two.
The prisoner A has a friend who is the warder of the prison and who knows which prisoners
will be released. Prisoner A realizes that it would be unethical to ask the warder if he, A, is
to be released, but decides to ask for the name of one prisoner other than himself who is to
be released. The warder says “B will be released”. What are the chances of A being released?

7
We need to calculate
P {A will be released | Warder says B will be released} .
By the product rule of probability, the above probability is the same as
P {A and B will be released}
P {Warder says B will be released}
To calculate the numerator, it is natural to make the assignment
P {A and B will be released} = P {B and C will be released} = P {A and C will be released} = 1/3.
For the denominator, we can split as
P {Warder says B will be released}
= P {Warder says B will be released | A and B will be released} P {A and B will be released}
+ P {Warder says B will be released | B and C will be released} P {B and C will be released}
+ P {Warder says B will be released | A and C will be released} P {A and C will be released}
1 1 1
= 1 × + P {Warder says B will be released | B and C will be released} × + 0 ×
3 3 3
1 1
= + P {Warder says B will be released | B and C will be released} .
3 3
To proceed further, we need a probability assignment for
P {Warder says B will be released | B and C will be released}
It is natural to assume that this probability equals 0.5. This means that, in the event that
B and C are the two prisoners who will be released, the warder is equally likely to reveal
the name of B or C to A. Under this assumption, we have
1 1 1 1
P {Warder says B will be released} = + × =
3 3 2 2
leading to
1/3 2
P {A will be released | Warder says B will be released} = = .
1/2 3
Note that this means that prisoner A’s chances of being released remain the same in spite
of the additional information revealed by the warder.

Here is an interesting wrinkle on this problem. Suppose that the conversation between the
prisoner A and the Warder was overheard by prisoner C who then proceeds to calculate his
own chances of being released in light of the additional information.
P {C will be released | Warder says B will be released}
P {C will be released, Warder says B will be released}
=
P {Warder says B will be released}
The denominator is the same as before so it will be 1/2. For the numerator, note that C
will be released either with A or with B. The case where C and A will be released is ruled
out because the Warder said “B will be released”. So the numerator equals:
P {C and B will be released, Warder says B will be released}
= P {Warder says B will be released | B and C will be released} P {B and C will be released}
1 1 1
= × = .
2 3 6
Therefore the probability of C’s release given the additional information is (1/6)/(1/2) = 1/3.
The additional information therefore significantly reduces the chances of C’s release (from
2/3 to 1/3).

8
2.2 Example 4: Monty Hall Problems

Example 2.2 (Monty Hall Problem). Suppose you’re on a game show, and you’re given the
choice of three doors: Behind one door is a car; behind the others, goats. You pick a door,
say No. 1, and the host, who knows what’s behind the doors, opens another door, say No.
3, which has a goat. He then says to you, “Do you want to pick door No. 2?” Is it to your
advantage to switch your choice?

Let us suppose that I always pick Door 1 to start the game. We then need to calculate
the conditional probability:

P {car in door 2 | host opened door 3}

which we can write as


P {host door 3 | car door 2} P {car door 2}
.
P {host door 3 | car door 2} P {car door 2} + P {host door 3 | car door 1} P {car door 1}

We now make the following natural probability assignment:


1
P {host door 3|car door 2} = 1 and P {host door 3|car door 1} =
2
and also
1
P {car door 1} = P {car door 2} = .
3
This leads to
1 ∗ (1/3) 2
P {car door 2|host door 3} = =
1 ∗ (1/3) + (1/2) ∗ (1/3) 3

and since this probability is more than 0.5, it makes sense for me to switch to door 2 from
my original selction of door 1.

2.3 Example 5: MacKay sequence example

The following application of probability theory can be seen as a way of formalizing common
sense. Probability theory has been described by some (for example, Laplace) as an extension
of common sense. Here is a quote by Laplace on this: It is seen in this essay that the theory
of Probabilities is at bottom only common sense reduced to calculus; it makes us appreciate
with exactitude that which exact minds feel by a sort of instinct without being able ofttimes
to give a reason for it.–Laplace.

The application given below is from the book MacKay [5, Chapter 28].

Problem 2.3. Find the next number in the sequence: −1, 3, 7, 11.

Note that this is a problem of reasoning under uncertainty as we are uncertain about the
way this sequence of numbers has been generated. One way of using probability theory to
solve this problem is the following. We can have two models for the number generation
mechanism here:

1. Model 1: Arithmetic Progression i.e., a1 = α and an+1 = an + β.

9
2. Model 2: Random

Most people would look at the sequence and guess the next number as 15. In other words,
they are using Model 1 (Arithmetic Progression). Probability theory can be used to justify
this. We need to calculate

P {Model i | data} for i = 1, 2.

What probability assignments would we need to calculate the above? We can use the Bayes
Rule to write
P {data|Model i} P{Model i}
P {Model i|data} = (5)
P {data|Model 1} P{Model 1} + P {data|Model 2} P{Model 2}
To be fair to each of the two models, we shall take
1
P{Model i} = for each i = 1, 2 (6)
2
We now need to calculate P {data|Model i} for i = 1, 2. For i = 1, we have (below α and β
are the parameters in Model 1):

P {data|Model 1} = P{α = −1, β = 4}

To calculate the above, we need to make a probability assignment for the probability with
which α and β take various values. MacKay [5, Chapter 28] assumes that α and β are integer-
valued that they are independently uniformly distributed over the set {−50, −49, . . . , 49, 50}
which has cardinality 101. Then

1 2
 
P {data|Model 1} = P{α = −1, β = 4} = P{α = −1}P{β = 4} = ≈ 9.8×10−5 . (7)
101
For the second model, we need to specify what we mean by “random”. We shall take this to
mean that a1 , a2 , a3 , a4 are independently distributed according to the uniform distribution
on {−50, −49, . . . , 49, 50}. Then

P {data|Model 2} = P {a1 = −1, a2 = 3, a3 = 7, a4 = 11}


 4
1
= P{a1 = −1}P{a2 = 3}P{a3 = 7}P{a4 = 11} = ≈ 9.6 × 10−9 .
101

Plugging in the above value (as well as (6) and (7)) in (5), we get

101−2 × 0.5
P {Model 1|data} ≈ = 0.999902 and P {Model 2|data} = .000098
101−2 × 0.5 + 101−4 × 0.5
This analysis clearly favors Model 1 compared to Model 2. The most interesting feature
about this analysis is that

P {Model 1|data} ≫ P {Model 2|data}

even though
P{Model 1} = P{Model 2}.
In other words, we did not dogmatically assert that the data was generated by an arithmetic
progression but we gave a fair chance to the two models to explain the observed sequence.

In this example, some people argue in favor of Model 1 on the basis that Model 1 is
“simpler” than Model 2. Our analysis above (based on probability theory) does not invoke

10
any vague notion of simplicity but does some formal calculations which in this case preferred
Model 1 to Model 2. In another situation, Model 2 may well be the preferred model.

One can consider other alternative models in this problem. For example, MacKay [5,
Chapter 28] considered the following cubic model:

Model 3 (Cubic): These numbers were generated by the formula: a1 = a and an+1 =
ba3n + ca2n + d for an integer a and rational numbers b, c, d.

This cubic model explains the given data perfectly if and only if its four parameters a, b, c, d
are chosen as a = −1, b = −1/11, c = 9/11, d = 23/11. As a result,
P {data|Model 3} = P{a = −1, b = −1/11, c = 9/11, d = 23/11}.
In order to explicitly calculate the above, we need to make probability assignments for
a, b, c, d. MacKay [5] makes the following probability assignment: we assume that these four
parameters are independent with a being uniform on {−50, −49, . . . , 49, 50} and b, c, d having
the distribution of x/y where x ∼ Unif{−50, −49, . . . , 49, 50} and y ∼ Unif{1, . . . , 50} are
independent. Under this assignment:
P{a = −1, b = −1/11, c = 9/11, d = 23/11} = P{a = −1}P{b = −1/11}P{c = 9/11}P{d = 23/11}
    
1 1 1 1 1 1 1
= 4· · 4· · 2· ·
101 101 50 101 50 101 50
≈ 2.5 × 10−12 .
In the above, we used P(b = −1/11) = 4·(1/101)·(1/50) because −1/11 = −2/22 = −3/33 =
−4/44 and each of these has the probability (1/101) · (1/50). A similar reasoning is used for
P{c = 9/11} and P{d = 23/11}.

If Model 1, 2, 3 are the only three models considered, the Bayes rule (5) becomes
P {data|Model i} P{Model i}
P {Model i|data} =
P {data}
where the denominator should be calculated as:
3
X
P {data} = P {data|Model i} P{Model i}.
i=1

Under the fair assumption


1
P{Model i} = for each i = 1, 2, 3,
3
we obtain
101−2 × (1/3)
P{Model 1|data} = = 0.999902
101−2 × (1/3) + 101−4 × (1/3) + 2.5 × 10−12 × (1/3)
and
P{Model 2|data} ≈ 9.8 × 10−5
and
P{Model 3|data} ≈ 9.8 × 10−5 ≈ 2.55 × 10−8 .
Our preference for Model 1 is still as strong as before (when we only considered the two
models Model 1 and Model 2).

The analysis given here depends on the specific choices of priors used for the three models.
One can of course use alternative priors but the qualitative preference for Model 1 is unlikely
to change for most reasonable prior choices.

11
2.4 Interpretation of Probability

There are multiple interpretations of probability and still some controversy regarding what
probability really is. A proper understanding of the meaning of probability is very important
for applications of probability to statistics and data analysis.

There are broadly two ways of understanding probability.

2.4.1 Frequentist/Objective Understanding of Probability

From the frequentist viewpoint, probability is applicable only in the context of “random
experiments” (such as tossing coins and rolling dice). The probability P(A) of an event A is
defined as the relative frequency that A occurs in N repeated trials of the experiment in the
limit as N → ∞:
nA
P(A) := lim
N →∞ N

where nA is the number of trials out of N where A occurs. This definition of probability is
the basis of frequentist statistics. Here are some examples:

1. The statement P(H) = 0.5 means that the proportion of heads in a large number of
tosses of the coin approaches 0.5.

2. The statement
i.i.d
ϵ1 , . . . , ϵn ∼ N (0, σ 2 )
means that if the experiment generating ϵ1 , . . . , ϵn is repeated a large number of times,
the proportion of times the values of (ϵ1 , . . . , ϵn ) lie in a set A approaches
n
x2i
Z Y  
1
√ exp − 2 dx1 . . . dxn
A i=1 2πσ 2σ

and this should be true for all subsets A of Rn .

3. The statement   
S S
P θ ∈ X̄ − 1.96 √ , X̄ + 1.96 √ = 0.95
n n
means that if we repeat the experiment generating the random variables X̄ and S a
large number of times, then the proportion of times the interval
 
S S
X̄ − 1.96 √ , X̄ + 1.96 √
n n

contains θ approaches 0.95.

The following are some obvious problems with the frequentist definition:

1. It is very restrictive and hardly ever applicable. In many simple situations where we
would like to use probability, the frequency definition is simply does not apply:

a) Is the suspect X guilty?

b) What is the chance of rain in Berkeley today?

c) What is the chance that Y is cancer positive given that they tested positive?

12
2. Even in situations where the frequency definition is seemingly applicable, closer thought
might reveal some issues. For example, the frequentist probability that a coin comes
up heads is 0.6 means that 60% of a large number of tosses of the coin should result in
0.6. But the mechanics of no two tosses are really identical and if two tosses are done
exactly identically, then we would expect the same outcome by the laws of physics. So
the term “identical and independent repetitions of an experiment” is ambiguous.

In the frequentist definition, probability is considered an intrinsic property of the object


under investigation which is only accessible by an experiment generating samples of infinite
size. The frequentist probability is also referred to as “objective probability”. The impli-
cation is that we cannot assign it arbitrarily because any probability assignment that does
not agree with the frequency in infinite trials is wrong. Unfortunately, the actual frequen-
tist probabililty is seldom known because one cannot generally observe a large number of
repetitions of an experiment and so almost all probability assignments are wrong from the
frequentist point of view. This is one way of understanding the statistics aphorism: “All
models are wrong” (usually attributed to George Box; see https://en.wikipedia.org/
wiki/All_models_are_wrong).

3 Lecture Three

3.1 Interpretation of Probability

There are multiple interpretations of probability and still some controversy regarding what
probability really is. A proper understanding of the meaning of probability is very important
for applications of probability to statistics and data analysis.

There are broadly two ways of understanding probability.

3.1.1 Frequentist/Objective Understanding of Probability

From the frequentist viewpoint, probability is applicable only in the context of “random
experiments” (such as tossing coins and rolling dice). The probability P(A) of an event A is
defined as the relative frequency that A occurs in N repeated trials of the experiment in the
limit as N → ∞:
nA
P(A) := lim
N →∞ N

where nA is the number of trials out of N where A occurs. This definition of probability is
the basis of frequentist statistics.

According to this definition, the statement P(H) = 0.5 means that the proportion of heads
in a large number of tosses of the coin approaches 0.5.

The following are some obvious problems with the frequentist definition:

1. It is very restrictive and hardly ever applicable. In many simple situations where we
would like to use probability, the frequency definition is simply does not apply:

a) Is the suspect X guilty?

b) What is the chance of rain in Berkeley today?

c) What is the chance that Y is cancer positive given that they tested positive?

13
For an interesting anecdote about how this restrictive notion does not simply make
sense in some important problems, see deGroot [3, pages 43-44].

2. Even in situations where the frequency definition is seemingly applicable, closer thought
might reveal some issues. For example, the frequentist probability that a coin comes
up heads is 0.6 means that 60% of a large number of tosses of the coin should result in
0.6. But the mechanics of no two tosses are really identical and if two tosses are done
exactly identically, then we would expect the same outcome by the laws of physics. So
the term “identical and independent repetitions of an experiment” is ambiguous.

In the frequentist definition, probability is considered an intrinsic property of the ob-


ject under investigation which is only accessible by an experiment generating samples of
infinite size. The frequentist probability is also referred to as “objective probability”.
The implication is that we cannot assign it arbitrarily because any probability assign-
ment that does not agree with the frequency in infinite trials is wrong. Unfortunately,
the actual frequentist probabililty is seldom known because one cannot generally observe a
large number of repetitions of an experiment and so almost all probability assignments are
wrong from the frequentist point of view. This is one way of understanding the statis-
tics aphorism: “All models are wrong” (usually attributed to George Box; see https:
//en.wikipedia.org/wiki/All_models_are_wrong).

Here are some quotes by famous statisticians/probabilists illustrating how widespread


frequentist thinking in probability is:

The numbers pr should, in fact, be regarded as physical constants of the particular die
that we are using, and the question as to their numerical values cannot be answered by the
axioms of probability, any more than the size and the weight of the die are determined by the
geometrical and mechanical axioms. However, experience shows that in a well-made die the
frequency of any event r in any long series of throws usually approaches 1/6, and accordingly
we shall often assume that all the pr are equal to 1/6... – Cramér.

Here is Jaynes’s response to the above quote (from page 317 of his book): To a physicist,
this statement seems to show utter contempt for the known laws of mechanics. The results
of tossing a die many times do not tell us any definite number characteristic only of the
die. They tell us also something about how the die was tossed. If you toss ’loaded’ dice in
different ways, you can easily alter the relative frequencies of the faces. With only slightly
more difficulty, you can still do this if your dice are perfectly ’honest’.

Here is a quote by Feller (see page 322 of the Jaynes book) illustrating the thinking that
bridge hands possess physical probabilities and that the uniform probability assignment is
a convention whose correctness can only be verified by observed frequencies in a random
experiment : The number of possible distributions of cards in bridge is almost 1030 . Usually
we agree to consider them as equally probable. For a check of this convention more than 1030
experiments would be required – a billion of billion of years if every living person played one
game every second, day and night. – Feller.

In spite of these objections, one positive aspect of the frequentist meaning of probability
is that the Rules of Probability follow easily from this definition.

3.1.2 Subjective or Bayesian Understanding of Probability

It should be clear that in order to use probability as a general method for reasoning under
uncertainty, we need a much more general understanding of probability than that is allowed

14
by the frequentist notion. Expounding this general theory is the purpose of the Jaynes book
[1].

The basic idea is to first give up on an objective definition of probability and just admit
that there is no such thing as a physical probability (Jaynes [1, page 325]). Every probability
is subjective and is relative to the person who is actually reasoning under uncertainty. More
specifically, probability of an event is something that a specific individual (or a robot or
a computer) either assigns based on their state of knowledge (available information) or
calculates based on their probability assignments for related events. Generally, probability
has nothing do with frequency (unless we are in certain special situations where frequency
information is available; we shall see examples of this later). Here is a quote by Harold
Jeffreys (who was one of the founders of this way of thinking about probability) related to
this:

The essence of the present theory is that no probability, direct, prior, or posterior, is simply
a frequency. – Jeffreys 1939.

To give a concrete example, from the Bayesian viewpoint, the statement P(H) = 0.5 will
considered to be the assignment made by some specific individual based on their background
information. It means that, based on their background information, they have no reason at
all to distinguish between H and T and thus they are totally confused about whether the
specific toss will lead to an H or a T .

It is very interesting to note that the same statement P(H) = 0.5 is an informative
objective fact in the frequentist viewpoint while it is an uninformative assignment in the
Bayesian viewpoint.

To understand the Bayesian interpretation, consider developing a spam filter that classifies
incoming emails as spam or regular (there are many statistics/machine learning methods for
doing this including, say, logistic regression or classification trees). Suppose you have a
trained spam filter and you apply it to a specific incoming email. If the filter outputs a
predicted probability of the email being spam as 0.5, you would conclude that the filter has
no idea whether this email is spam or regular. This is exactly the Bayesian viewpoint.

3.1.3 Rules of Probability

Let us now look at the rules of the probability when probability is viewed from the Bayesian
viewpoint which has nothing with do with frequencies:

1. P(A) always lies between 0 and 1. The probability of an impossible event is 0 and the
probability of a certain event is 1.

2. Product rule: P(A ∩ B) = P(A)P(B|A) = P(B)P(A|B).

3. Sum rule: P(A ∪ B) = P(A) + P(B) for disjoint events A and B.

These rules can be justified in a straightforward way if we use the frequency definition of
probability. We are now using a more general form of probability which has nothing to do
with frequencies and which are assignments made by a specific user. What is constraining
the user to follow the above rules? The following quote by Fisher (1934) asks the same
question:

Keynes establishes the laws of addition and multiplication of probabilities, by stating these
laws in the form of definitions of the processes of addition and multiplication. The important

15
step of showing that, when these probabilities have numerical values, “addition” and “mul-
tiplication” are so defined, are equivalent to the arithmetical processes ordinarily known by
these names, is omitted. The omission is an interesting one, since it shows the difficulty
of establishing the laws of mathematical probability, without basing the notion of probability
on the concept of frequency, for which these laws are really true, and from which they were
originally derived.

It is important to be able to justify the rules of probability for this general form of prob-
ability. Otherwise, there will be no principled way of computing probabilities of things we
really care about and the whole business will be quite arbitrary.

The following justification for the rules of probability is originally due to the physicist R.
T. Cox and is the content of Chapter 1 and Chapter 2 of Jaynes [1]. I will give a sketch of
the argument skipping some important technical details. For the full argument, please read
Jaynes [1, Chapter 1 and 2].

Let us first remove all restrictions on probabilities and even allow them to take values
outside the interval [0, 1]. To avoid confusion, let us use the term “plausibilities”. We are
assigning plausibilities of various events (or propositions) conditional on other events. Let
us denote the plausibility of event A conditional on event B by (A|B). Let us first make the
assumption that plausibilities take values in the set of real numbers (no restriction now to
be in the interval [0, 1]) and that a higher value of plausibility represents a greater belief.

3.1.4 Product Rule

Let us first investigate why the product rule should be true. The product rule in terms of
probabilities states that
P(AB|C) = P(B|C)P(A|BC)
Here AB denotes the event A ∩ B. Should our plausibilities satisfy a similar inequality?
Let us first assume that the plausibility (AB|C) should really be determined by the two
plausibilities (B|C) and (A|BC). This is basically because the process of deciding that AB
is true can be broken down into first deciding whether B is true and then, having accepted
B as true, deciding whether A is true. We shall therefore assume that there should be a
function F such that
(AB|C) = F ((B|C), (A|BC)).
We also assume that we should use the same function F for all possible events A, B, C (i.e.,
we are not using one function F for some A, B, C while calculating (AB|C) from (B|C) and
(A|BC) and using another function F for different A, B, C). This means in particular that

(AB|C) = F ((A|C), (B|AC)).

It is also reasonable to assume that F (x, y) is monotone increasing in each of its arguments
and that it is continuous. If it is not continuous, then a small change in (B|C) (or (A|BC))
might lead to a large change in (AB|C) which is undesirable.

Now if we have four events A, B, C, D, we can write

(ABC|D) = F ((BC|D), (A|BCD)) = F (F ((C|D), (B|CD)), (A|BCD)).

We can also write

(ABC|D) = F ((C|D), (AB|CD)) = F ((C|D), F ((B|CD), (A|BCD))).

16
We shall now make the following important consistency assumption: If a plausibility can
be calculated via two different methods, then both methods should give the
same answer. Clearly if this assumption were violated, then our answer to a plausibility
calculation would depend on the specific method chosen to calculate and this would be highly
undesirable. This assumption immediately implies that

F (F ((C|D), (B|CD)), (A|BCD)) = F ((C|D), F ((B|CD), (A|BCD))).

for all A, B, C, D. If the individual plausibilities are arbitrary, we would get the following
condition that the function F should satisfy

F (F (x, y), z) = F (x, F (y, z)) for all real numbers x, y, z.

It now turns out the only functions F which satisfy the above equation are of the form

F (x, y) = w−1 (w(x)w(y))

for a positive continuous increasing function w. I will skip this derivation (see Section 2.1,
Chapter 2 of Jaynes [1]). We thus have

(AB|C) = F ((B|C), (A|BC)) = w−1 (w(B|C)w(A|BC)) . (8)

This is equivalent to
w(AB|C) = w(B|C)w(A|BC).
Now if we take B = A, we get

w(A|C) = w(A|C)w(A|AC)

The event A|AC can be seen as certainty so we get

w(A|C) = w(A|C)w(certainty)

for all A and C. This can happen only if

w(certainty) = 1. (9)

Also if we take B = Ac in (8), we get

w(AAc |C) = w(Ac |C)w(A|Ac C)

AAc |C and A|Ac C can both be taken to represent impossibility so we get

w(impossible) = w(Ac |C)w(impossible)

for all A and C which gives


w(impossible) = 0. (10)
(9) and (10), along with the monotonicity of w, imply

0 ≤ w(A|B) ≤ 1 for all A and B. (11)

We have thus proved that w(A|B) lies always between 0 and 1 (is 0 for impossibility and 1
for certainty) and it satisfies the product rule of probability:

w(AB|C) = w(A|C)w(B|AC) = w(B|C)w(A|BC). (12)

In other words, if we apply this this function w to our plausibilities, then the resulting
assignments satisfy the first two rules of probability.

17
3.1.5 Sum Rule

Below we shall sketch the argument for the sum rule of probability (the full details can be
found in Section 2.2 of Jaynes [1]). For a proposition A, we denote its complement by Ac
(i.e., Ac refers to the proposition that A is not true). Suppose that the plausibility w(Ac |C)
should be a function of w(A|C):

w(Ac |C) = S(w(A|C)).

This is intuitively meaningful as the plausibility of Ac should be determined by the plausi-


bility of A. We also assume that we use the same function S for every A, C. This function
S maps [0, 1] to [0, 1] and it should be a self-reciprocal function because S(S(w(A|C))) =
S(w(Ac |C)) = w(A|C) i.e., S(S(x)) = x or S −1 (x) = S(x). It should also satisfy (by taking
A to be certainty) S(1) = 0.

There is another condition that S needs to satisfy as a consequence of the fact that w(A|B)
satisfies the product rule. For three propositions A, B, C, we have
w(AB c |C)
 
c
w(AB|C) = w(A|C)w(B|AC) = w(A|C)S(w(B |AC)) = w(A|C)S .
w(A|C)
Switching A and B, we get
w(Ac B|C)
 
w(AB|C) = w(B|C)S .
w(B|C)
We thus have
w(AB c |C) w(Ac B|C)
   
w(A|C)S = w(B|C)S
w(A|C) w(B|C)
for all A, B, C.

Now suppose A and B are such that B c is contained in A (or equivalently Ac is contained
in B). This means that whenever B c is true, A is also true. So we have AB c = B c and
Ac B = Ac . Thus
w(B c |C) w(Ac |C)
   
w(A|C)S = w(B|C)S
w(A|C) w(B|C)
which is equivalent to
   
S(w(B|C)) S(w(A|C))
w(A|C)S = w(B|C)S .
w(A|C) w(B|C)
The above equation should be true for all A, B, C such that B c is contained in A. Letting
x = w(A|C), y = w(B|C) so that S(y) = w(B c |C) ≤ w(A|C) = x, we thus have
   
S(y) S(x)
xS = yS for all 0 ≤ x, y ≤ 1 with 0 ≤ S(y) ≤ x.
x y
One can then show that the above condition implies that

S(x) = (1 − xα )1/α for x ∈ [0, 1]

for some α > 0. This argument is somewhat technical and you can read it in Jaynes [1,
Section 2.2].

We have thus proved that

w(Ac |C) = (1 − wα (A|C))1/α

18
which is equivalent to

wα (Ac |C) = 1 − wα (A|C) or wα (Ac |C) + wα (A|C) = 1.

It can now be noted that the rules (9), (10), (11) and (12) that w(A|B) satisfies are also
satisfied by wα (A|B). Thus wα (A|B) satisfies all the three rules

1. 0 ≤ wα (A|B) ≤ 1, wα (impossible) = 0, and wα (certain) = 1,

2. wα (AB|C) = wα (A|C)wα (B|AC) = wα (B|C)wα (A|BC), and

3. wα (Ac |C) + wα (A|C) = 1.

We shall therefore denote wα by P and call it probability. P then satisfies the rules:

1. 0 ≤ P(A|B) ≤ 1, P(impossible) = 0, and P(certain) = 1,

2. Product Rule: P(AB|C) = P(A|C)P(B|AC) = P(B|C)P(A|BC), and

3. Sum Rule: P(Ac |C) + P(A|C) = 1.

Note that these usual laws of probability hold because of the need for logical consistency
and not because our probability has anything to do with frequencies.

The sum rule of probability is usually stated as

P(A ∪ B|C) = P(A|C) + P(B|C) − P(AB|C) (13)

where A ∪ B denotes the proposition “at least one of A and B is true” (Jaynes uses the
notation A + B for A ∪ B). (14) can be derived from the stated rules as

P(A ∪ B|C) = 1 − P((A ∪ B)c |C)


= 1 − P(Ac ∩ B c |C)
= 1 − P(B c |Ac C)P(Ac |C)
= 1 − (1 − P(B|Ac C)) P(Ac |C)
= 1 − P(Ac |C) + P(B|Ac C)P(Ac |C)
= P(A|C) + P(Ac B|C)
= P(A|C) + P(B|C)P(Ac |BC)
= P(A|C) + P(B|C)(1 − P(A|BC))
= P(A|C) + P(B|C) − P(B|C)P(A|BC) = P(A|C) + P(B|C) − P(AB|C).

4 Lecture Four

4.1 Recap: Derivation of the Rules of Probability for Subjective Probability

In the last class, we sketched the derivation of the rules of probability from logical consistency
without relying on any imagined frequency considerations for defining probability. The
argument (taken from Jaynes [1, Chapter 2]) proceeded in the following way.

We denoted the “plausibility” of a proposition A given some information in the form of


proposition B by (A | B). We assumed that these plausibilities take values in the set of real
numbers (no restriction to be in the interval [0, 1]) and that a higher value of plausibility
represents greater belief.

19
To deduce the usual product rule of probability, we assumed the existence of a continuous
coordinate-wise increasing function F of two real variables such that
(AB | C) = F ((B | C), (A | BC))
for all A, B, C. This function can then be employed in two different orders to calculate
(ABC | D) for four propositions A, B, C, D:
(ABC | D) = F (F ((C|D), (B|CD)), (A|BCD))
(ABC | D) = F ((C|D), F ((B|CD), (A|BCD))).
It is therefore natural to assume that the function F should be such that the right hand sides
of the above two equations produce the same answer. From this, one gets the condition:
F (F (x, y), z) = F (x, F (y, z)) for all real numbers x, y, z.
As proved in Jaynes [1, Section 2.1], the only functions F which satisfy the above equation
are of the form
F (x, y) = w−1 (w(x)w(y))
for a positive continuous increasing function w.

From here, we derived the following:

1. w(AB | C) = w(A | C)w(B | AC) = w(B | C)w(A | BC).

2. w(A | C) always lies between 0 and 1 with w(impossible) = 0 and w(certain) = 1.

In other words, the function w applied to the plausibilities leads to quantities which satisfy
the first two rules of probability.

Next the goal is to derive the sum rule. Here we first assume that there exists a function
S : [0, 1] → [0, 1] such that
w(Ac | C) = S(w(A | C))
for all A and C. Note that we are working with w(A | C) instead of the raw plausibilities
(A | C). This allows us to use the product rule which has already been derived. To set up
the characterizing equation for S(·), consider the setting of Figure 2.

In this setting, there are two different ways of calculating the plausibility of the proposition
R = AB in terms of x := w(A) and y := w(B) and the function S. Both these calculations
use the product rule. The first method for calculating w(R) = w(AB) is:
w(AB) = w(A)w(B | A)
= w(A)S (w(B c | A))
w(B c )
 
= w(A)S
w(A)
   
S(w(B)) S(y)
= w(A)S = xS .
w(A) x
The second method for calculating w(R) = w(AB) simply switches the roles of A and B in
the first method:
w(AB) = w(B)w(A | B)
= w(B)S (w(Ac | B))
w(Ac )
 
= w(B)S
w(B)
   
S(w(A)) S(x)
= w(B)S = yS .
w(B) y

20
Figure 1: Setting for deriving the Sum Rule

It is therefore natural to assume that S satisfies:


   
S(y) S(x)
xS = yS .
x y
Recall that here x = w(A) and y = w(B). The setting is such that x and y cannot be
completely arbitrary. Indeed because B c ⊆ A, we must have

w(B c ) ≤ w(A) or equivalently S(y) ≤ x.

Our condition on S is therefore


   
S(y) S(x)
xS = yS for all 0 ≤ x, y ≤ 1 with 0 ≤ S(y) ≤ x.
x y
It is now proved in Jaynes [1, Section 2.2] that the above condition implies that

S(x) = (1 − xα )1/α for x ∈ [0, 1]

for some α > 0.

We have thus proved that

w(Ac |C) = (1 − wα (A|C))1/α

which is equivalent to

wα (Ac |C) = 1 − wα (A|C) or wα (Ac |C) + wα (A|C) = 1.

It can now be noted that the first two rules that are satisfied by w(A|B) are also satisfied
by wα (A|B). Thus wα (A|B) satisfies all the three rules

21
1. 0 ≤ wα (A|B) ≤ 1, wα (impossible) = 0, and wα (certain) = 1,

2. wα (AB|C) = wα (A|C)wα (B|AC) = wα (B|C)wα (A|BC), and

3. wα (Ac |C) + wα (A|C) = 1.

Denoting wα by P, we have

1. 0 ≤ P(A|B) ≤ 1, P(impossible) = 0, and P(certain) = 1,

2. Product Rule: P(AB|C) = P(A|C)P(B|AC) = P(B|C)P(A|BC), and

3. Sum Rule: P(Ac |C) + P(A|C) = 1.

We have thus derived the rules of probability without invoking any relationship between
probability and long-run frequency.

To summarize: If we argue in terms of plausibilities but we take care to ensure some


natural constraints for logical consistency, then we cannot manipulate the plausibilities ar-
bitrarily but have to reason according to the usual rules of probability after an appropriate
transformation (this transformation is given by the function wα ).

The sum rule of probability is usually stated as

P(A ∪ B|C) = P(A|C) + P(B|C) − P(AB|C) (14)

where A ∪ B denotes the proposition “at least one of A and B is true” (Jaynes uses the
notation A + B for A ∪ B). (14) can be derived from the stated rules as

P(A ∪ B|C) = 1 − P((A ∪ B)c |C)


= 1 − P(Ac ∩ B c |C)
= 1 − P(B c |Ac C)P(Ac |C)
= 1 − (1 − P(B|Ac C)) P(Ac |C)
= 1 − P(Ac |C) + P(B|Ac C)P(Ac |C)
= P(A|C) + P(Ac B|C)
= P(A|C) + P(B|C)P(Ac |BC)
= P(A|C) + P(B|C)(1 − P(A|BC))
= P(A|C) + P(B|C) − P(B|C)P(A|BC) = P(A|C) + P(B|C) − P(AB|C).

4.2 Probability Assignment

As we discussed in Lecture One, probability theory works according to the two steps: (a)
probability assignment, and (b) calculation. The calculation in the second step is based on
the rules of probability and we have seen just the rationale behind the rules.

We shall now make some general comments on the probability assignment step. The
most important word here is “Information”. Probability assignment is always made by
some specific individual based on available or assumed information. The term “assumed
information” is important here because, quite often, certain aspects of available information
might be hard to precisely quantify so one may choose to ignore such aspects in order work
with a simpler probability assignment.

To make this first step of probability theory “rigorous”, we need precise rules for trans-
forming available/assumed information into probability assignments. The most fundamental

22
of these rules is known as the Principle of Indifference or the Principle of Insufficient
Reason and it states the following:

If, on background information B, the propositions A1 , . . . , AN are mutually exclusive and


exhaustive, and B does not favor any one of them over any other, then
1
P(Ai |B) = for i = 1, . . . , N . (15)
N

In other words, the Principle of Indifference states that if the background information B
is “symmetric” among A1 , . . . , AN , then one should use the probability assignment (15).

For a concrete example illustrating the Principle of Indifference, consider the following
setting. suppose we want to assign a probability for a coin toss landing in H. Let us
consider the following three kinds of information:

1. Information I1 : We don’t know anything at all about the coin. We don’t even know
if it really has two sides H and T or both of its sides are of only one kind (either H or
T ). In addition, we don’t know how exactly it will be tossed.

2. Information I2 : We know that it is a “regular” coin and that it has two sides H and
T but we don’t know how it will be tossed.

3. Information I3 : We know that this coin has been tossed a large number of times in
the past and it landed heads 70% of the time.

In the first case, the Principle of Indifference applies and we shall assign

P(H|I1 ) = 0.5 (16)

It is very important to recognize here that (16) is not a statement of frequency. Specifically
(16) does not mean that if we toss the coin a large number of times, it will land hands 50%
of the time. All we are saying is that we assign the probability of 0.5 for the coin landing
heads in this one specific toss that we are reasoning about. We simply do not have any
information to make speculations about the behaviour of a large number of tosses. In fact,
our information does not even tell us that the coin indeed has the two sides H and T . So
it might well be the case the coin tosses will result in HHHHH . . . or T T T T T . . . . But
we shall still assign (16) because our current information I1 does not allow us to distinguish
between H and T .

Now let us come to I2 which is more informative than I1 . However, even here, there is
nothing to distinguish between H and T . So the Principle of Indifference applies again and
we assign
P(H|I2 ) = 0.5. (17)
Again this has nothing to do with frequency. Even though it is a regular coin, it might be
tossed in a way to produce more heads than tails. So if an actual experiment were performed,
then depending on how the coin tosses were performed, the frequency of heads can be pretty
much anything between 0 and 1. Because our probability has nothing to do with frequency,
we shall simply assign (17) as our information I2 is symmetric in H and T . Note that if an
experiment is actually performed and the proportion of heads turns out to be different from
0.5, this does not contradict (17) at all. Because (17) is an assignment capturing our state
of knowledge I2 and is perfectly sensible under I2 .

Now let us come to I3 . Here the Principle of Indifference obviously does not apply. If
we were using the frequency definition, we would immediately assign P(H|I3 ) = 0.7. But

23
because our probability has nothing to do with frequency, we cannot jump to this assignment
immediately. The issue is that here we need to reason about not just one toss but this
imminent toss along with all the previous tosses. Specifically, in this situation I3 , we are
dealing with random variables X1 , . . . , XN corresponding to the previous large number of
tosses and the current toss XN +1 about which we are reasoning. We need to calculate:
 
X1 + · · · + XN
P XN +1 = 1 = 0.7 (18)
N
To calculate this, we need to make a more basic probability assignment for the joint distri-
bution of X1 , . . . , XN , XN +1 which allows us to compute P(H|I3 ). If we assume that
i.i.d
X1 , . . . , XN +1 ∼ Ber(0.5), (19)

then XN +1 will be independent from X1 , . . . , XN so that (18) will be 0.5 i.e., the given
frequency information is irrelevant under the model. On the other hand, under the more
complicated model assumption
i.i.d
X1 , . . . , XN +1 | Θ = θ ∼ Ber(θ) and Θ ∼ Unif[0, 1],

we shall show later that the probability (18) will be very close to 0.7 when N is large.

Therefore, in the third situation, when we actually have frequency information, under the
right kind of model, our analysis will lead to the frequency assignment. Thus in this theory
of probability, frequency will appear naturally in probability assignments when frequency
information is available and relevant to the problem.

Next we shall review standard probability distributions starting with the Hypergeometric
Distribution. These standard distributions will be useful for us while making probability
assignments.

4.3 Urn Problems: Hypergeometric Distribution

Consider an urn with N balls and assume that R of the N balls are red and the remaining
W := N − R are white. Assume that the balls are identical in every other respect.

Suppose we sample n balls from the urn without replacement. What is the probability of
seeing exactly r red balls in the sample? The answer, as we shall see, is given by
R W
 
r
N
w
n

where w := n − r is the number of white balls in the sample. This requires 0 ≤ r ≤ R and
0 ≤ w ≤ W . If these conditions are not satisfied, the required probability will be zero.

Here is one way of proving this probability statement. Let Ri denote the proposition that
the ith draw results in a red ball and let Wi denote the proposition that the ith draw results
in a white ball. Let us first consider the probability

P {R1 . . . Rr Wr+1 . . . Wn } .

Using the product rule of probability we can calculate the above probability as
" r # n
Y Y
P(R1 ) P{Ri | R1 . . . Ri−1 } P(Wr+1 | R1 . . . Rr ) P{Wj | R1 . . . Rr Wr+1 . . . Wj−1 }.
i=2 j=r+2

24
By the principle of indifference and the sum rule of probability, we get P(R1 ) = R/N ,

R−i+1
P{Ri | R1 . . . Ri−1 } = for i = 2, . . . , r,
N −i+1
W
P(Wr+1 | R1 . . . , Rr ) =
N −r
and
W −j+r+1
P{Wj | R1 . . . Rr Wr+1 . . . Wj−1 } = for r + 2 ≤ j ≤ n.
N −j+1
It then follows that
R W
 
R(R − 1) . . . (R − r + 1)W (W − 1) . . . (W − w + 1) r w 1
P {R1 . . . Rr Wr+1 . . . Wn } = = N n .

N (N − 1) . . . (N − n + 1)

n r

It can now be checked that, by the same argument, one also has
R W
 
r 1
P{R1 W2 R3 . . . Rr+1 Wr+2 . . . Wn } = N
w n .

n r

More generally, the probability of any specific sequence RW RRW . . . with exactly r reds is
given by
R W
 
r 1
N
w n .
n r
n
R′ s and W ′ s with exactly r R′ s is

Because the number of such sequences of r , we obtain
R W
 
r w
P{r reds in sample of size n} = N
 .
n

As a function of r, this is known as the Hypergeometric Probability Mass Function. The


mean of this distribution can be checked to be nR/N . Thus the average fraction of red balls
in the sample of size n will match the fraction of red balls in the urn. We can also compute
the most likely value of r i.e., the mode of the hypergeometric distribution. For this, let
R W
 
r
h(r) = N
w .
n

and one can easily calculate that

h(r + 1) R−r n−r


=
h(r) r + 1 W − (n − r) + 1

so that
h(r + 1) (n + 1)(R + 1)
≥ 1 ⇐⇒ r + 1 ≤ .
h(r) N +2
This gives that
(n + 1)(R + 1)
⌊ ⌋
N +2
can be taken to be the mode of the hypergeometric distribution. When n and N are large,
the quantity above is approximately nR/N so that the most likely value of r is approximately
such that the sample fraction of red balls matches the fraction of red balls in the urn.

See Jaynes [1, Chapter 3] for more calculations in this urn setting.

25
5 Lecture Five

5.1 The Hypergeometric Distribution

In the last class, we studied the Hypergeometric Distribution in the following urn setting.
There is an urn with N balls of which R are red and the remaining W := N − R are white.
Assume that the balls are identical in every other respect. We then sample n balls from
the urn without replacement. What is the probability of seeing exactly r red balls in the
sample? We saw that the answer is given by:
R W
 
r
N
w
n

where w := n − r is the number of white balls in the sample.

We can let X to be the random variable denoting the number of red balls in the drawn
sample of size n. Then
R W
 
r
P{X = r} = N
w . (20)
n
This X is said to have the Hypergeometric Distribution with parameters N, R, n.

Let us go over the proof of (20) using notation that is different from last time. For each
i = 1, . . . , n, let Xi denote the binary random variable that equals 1 if the ith draw results
in a red ball and equals 0 if the ith draw results in a white ball. Then it is easy to see that

X = X1 + · · · + Xn .

By the argument given at the end of the last lecture, we have ( ni=1 xi below plays the role
P
of r in (20)):
PnR PNn
 
i=1 xi n− i=1 xi 1
P {X1 = x1 , . . . , Xn = xn } = N
 n
. (21)
Pn
n i=1 xi

An important feature ofP (22) is that the right hand side depends on the individual x1 , . . . , xn
only through their sum ni=1 xi . This implies that the distribution of X1 , . . . , Xn is the same
as Xπ1 , . . . , Xπn for every permutation π1 , . . . , πn of 1, . . . , n:

P{Xπ1 = x1 , . . . , Xπn = xn } = P{X1 = x1 , . . . , Xn = xn } (22)

for every x1 , . . . , xn ∈ {0, 1}.

Random variables having the property (22) are known as exchangeable. Thus X1 , . . . , Xn
are exchangeable random variables. Here are some consequences of exchangeability.

1. X1 , . . . , Xn have identical distributions i.e.,

P{X1 = x} = P{X2 = x} = · · · = P{Xn = x} (23)

for every x ∈ {0, 1}. The reason behind (23) is that for every i = 2, . . . , n,
X
P{Xi = x} = P{X1 = x1 , Xi = x, Xj = xj for j ̸= 1, j ̸= i}
xj ,j̸=i
X
= P{X1 = x, Xi = x1 , Xj = xj for j ̸= 1, j ̸= i} = P{X1 = x}
xj ,j̸=i

26
Even though X1 , . . . , Xn have identical distributions, they are not independent however
because
R−1 R−1
P{X2 = 1 | X1 = 1} = and P{X2 = 1 | X1 = 0} = .
N −1 N
In general,
̸
i.i.d =⇒ exchangeable but exchangeable =⇒ i.i.d.

2. The distribution of every pair (Xi , Xj ) for i ̸= j is the same i.e.,

P{Xi = u, Xj = v} = P{X1 = u, X2 = v} (24)

for all u, v ∈ {0, 1}. This is true because, for example,

P{X3 = u, X4 = v}
X
= P{X1 = x1 , X2 = x2 , X3 = u, X4 = v, X5 = x5 , . . . , Xn = xn }
x1 ,x2 ,x5 ,...,xn
X
= P{X1 = u, X2 = v, X3 = x1 , X4 = x2 , X5 = x5 , . . . , Xn = xn }
x1 ,x2 ,x5 ,...,xn

= P{X1 = u, X2 = v}.

This proves that (X3 , X4 ) has the same distribution as (X1 , X2 ). The proof for other
pairs is similar.

3. The distribution of every k-tuple (Xi1 , . . . , Xik ) for any distinct indices i1 , . . . , ik from
{1, . . . , n} is the same. The proof is similar to that of (23) and (24).

5.1.1 Mean and Variance of the Hypergeometric Distribution

The mean and variance of the random variable X having the Hypergeometric distribution
(20) are given by:

nR nR(N − n)(N − R)
EX = and var(X) = .
N N 2 (N − 1)

These formulae can be easily derived using exchangeability of X1 , . . . , Xn and the fact that
X = X1 + · · · + Xn as follows:
nR
EX = E(X1 + · · · + Xn ) = nEX1 = nP{X1 = 1} =
N

27
and

EX 2 = E (X1 + · · · + Xn )2
 
X
= E X12 + · · · + Xn2 + Xi Xj 
i̸=j
 
X
= E X1 + · · · + Xn + Xi Xj 
i̸=j
nR
= + n(n − 1)E(X1 X2 )
N
nR
= + n(n − 1)P{X1 = 1, X2 = 1}
N
nR
= + n(n − 1)P{X1 = 1}P{X2 = 1 | X1 = 1}
N
nR R R−1
= + n(n − 1) .
N N N −1
From here, one can deduce that
 2
2 nR 2 R R−1 nR nR(N − n)(N − R)
var(X) = EX − (EX) = + n(n − 1) − = .
N N N −1 N N 2 (N − 1)

For more calculations involving the Hypergeometric Distribution, see Jaynes [1, Chapter 3].

5.2 Inverse Problem

We now address the following question. Suppose we actually draw n balls from the urn and
see that r of them are red. What can we then infer about the contents of the original urn?
Specifically, using the observed data n and r, what can we say about R and N ?

This problem has applications in survey sampling. Suppose we want to know the number
of people R in a city that support the republican party. We take a sample of n people out
of which r support republican. What can we then say about R?

We shall study this in the following cases.

5.2.1 Case 1: N is known and R is unknown

We need to calculate n o
P R = R̃ | data, N .

where data just refers to the observed values of n and r. Here R̃ denotes a specific integer
lying between 0 and N .

By the Bayes rule, we have


n o n o
P R = R̃ | data, N ∝ P data | R = R̃, N P{R = R̃ | N }
R̃ W̃
    
r w R̃ W̃
= N
 P{R = R̃ | N } ∝ P{R = R̃ | N }
n
r w

28
To proceed further with the calculation, we need to make an assignment for P{R = R̃ | N }
which we take to be
1 n o
P{R = R̃ | N } = I R̃ ∈ {0, 1, . . . , N } .
N +1
This means that we are not expressing any preference for all the potential values 0, 1, . . . , N
that R can take. This gives
n o R̃W̃  I{R̃ ∈ {0, 1, . . . , N }}
P R = R̃ | data, N =
r w C

where C is the constant


N    N   
X R̃ W̃ X R̃ N − R̃
C := =
r w r n−r
R̃=0 R̃=0

A standard mathematical fact involving binomial coefficients (often referred to as a Chu-


Vandermonde identity; see, for example, equation (9) in https://en.wikipedia.org/wiki/
Binomial_coefficient#Sums_of_the_binomial_coefficients) now gives
N     
X R̃ N − R̃ N +1
C= = .
r n−r n+1
R̃=0

We thus get the following formula for the posterior distribution of R:


n o R̃W̃  I{R̃ ∈ {0, 1, . . . , N }}
P R = R̃ | data, N = N +1
 .
r w n+1

Note how nicely this posterior distribution corresponds to common sense. For example, we
automatically get 0 for the posterior when R̃ < r and when W̃ = N − R̃ < w. This is
because R̃r = 0 when R̃ < r and W̃
 
w = 0 when W̃ < w. Also when there is no data (i.e.,
when n = r = 0), the posterior becomes equal to the prior.

We can summarize the above posterior distribution via its mean and variance which are
given by (see Chapter 6 of the Jaynes book for details behind the answers below)

(N + 2)(r + 1)
E (R | data, N ) = −1
n+2
and
p(1 − p)
var (R | data, N ) = (N + 2)(N − n)
n+3
where
r+1
p := . (25)
n+2
Observe that when N (the number of balls in the urn) is large, we can write
 
R r+1
E | data, N ≈ =p
N n+2

and  
R p(1 − p)
Var | data, N ≈ .
N n+3

29
R
A point estimate of N can be taken to be p (when N is large) and the uncertainty of this
q
point estimate can be taken to be the posterior standard deviation p(1−p)
n+3 . These of course
can be calculated from the observed data r and n.

Let us now do a predictive calculation. Let Rn+1 denote the proposition that the (n + 1)th
draw leads to a red ball. What is the probability of Rn+1 given the observed data? This is
obtained by
N
X n o n o
P {Rn+1 | data, N } = P Rn+1 | R = R̃, data, N P R = R̃ | data, N .
R̃=0

Given R = R̃ and the data, it is clear that, just before the (n + 1)th draw, the urn contains
R̃ − r red balls and N − r total balls. As a result
n o R̃ − r
P Rn+1 | R = R̃, data, N = .
N −n
Thus
N
X R̃ − r n o
P {Rn+1 | data, N } = P R = R̃ | data, N
N −n
R̃=0
 
R−r
=E | data, N
N −n
E (R | data, N ) − r
=
N −n
(N +2)(r+1)
n+2 −1−r r+1
= = =p
N −n n+2
because of (25). This equation is known as the Laplace Rule of Succession (more details on
this will be provided later). Observe that when n (the sample size) is large, we have
r+1 r
p= ≈
n+2 n
so that P {Rn+1 | data, N } is basically equal to the observed fraction of red balls in the
sample.

We can write the above formula in slightly different notation. Let X1 , . . . , Xn+1 be as
before i.e., Xi equals 1 if the ith draw leads to red and 0 if the ith draw leads to white. Then,
when n is large,
x1 + · · · + xn
P{Xn+1 = 1 | X1 = x1 , . . . , Xn = xn } ≈ .
n
This reveals a connection between probability and observed frequency that naturally appears
by a probability calculation.

5.2.2 Case 2: N is unknown and R is known

This situation arises in the Capture-Recapture problem in ecology. Suppose there is a given
pond with some fish and we want to estimate the number of fish in the pond. We take a
first fish sample of size R. We then color red (or just tag by some label) all the fish in our
sample and then let them back into the pond. Now the pond is like a urn with R red fish

30
and the remaining W = N − R non-red fish. We now take a second sample of size n and
observe that r of this second sample of fish are red. Based on knowledge of r, n, R, what can
we infer about N ?

The relevant calculation now is


n o n o
P N = Ñ | data, R ∝ P data | N = Ñ , R P{N = Ñ | R}
R Ñ −R Ñ −R
  
r w w
= P{N = Ñ | R} ∝ P{N = Ñ | R}
Ñ Ñ
 
n n

To proceed further with the calculation, we need to make an assignment for P{N = Ñ | R}
which we take to be
1 n o
P{N = Ñ | R} = I R ≤ Ñ < Nmax }
Nmax − R
for some large number Nmax . This gives
Ñ −R
n o  n o
w
P N = Ñ | data, R ∝ I Ñ < Nmax .


n

The presence of the binomial coefficient means that the posterior probability above is zero
unless Ñ ≥ R + w. This posterior will give us everything that we need to know about Ñ
after observing the data. If the posterior depends on Nmax , we need to be careful with the
results as it would mean that the data is not very informative for N . This would be the case
for example if r = 0.

5.2.3 Case 3: Both N and R are unknown

In this case, we need to place a prior for

P(N = Ñ , R = R̃).

If we assume that R | N is uniform on {0, . . . , N }, we would get

I{0 ≤ R̃ ≤ Ñ }
P(N = Ñ , R = R̃) = P(N = Ñ ) .
Ñ + 1
The posterior then becomes
R̃ W̃
 
 
r w I{0 ≤ R̃ ≤ Ñ }
P N = Ñ , R = R̃ | data ∝ P{N = Ñ }
Ñ Ñ + 1

n

In this case, we can do useful inference on R/N but we cannot learn anything nontrivial
from the data about N . To see this, calculate the marginal posterior probability of N and
show that  
P N = Ñ | data ∝ P(N = Ñ )I{Ñ ≥ n}.
This means that the posterior for N is just the prior truncated to the set {n, n + 1, . . . } so
we basically don’t learn anything about N from the sample other than the fact that it is at
least n. Nontrivial inference about N is only possible if we R and N are known to be linked
in some manner and we use a prior reflecting that link.

To learn more about inference of the parameters of the hypergeometric distribution, see
Jaynes [1, Chapter 6].

31
6 Lecture Six

6.1 Random Variables

We shall go over the Binomial and Negative Binomial distributions today. It will be con-
venient to use the language of random variables. The term “random variable” can be used
to describe any varying quantity (taking real values) about which we are uncertain about.
Many real-life quantities such as (a) The average temperature in Berkeley tomorrow, (b) The
height of the tallest student in this room, (c) the number of phone calls that I will receive
tomorrow, (d) the number of accidents that will occur on Hearst avenue in September, etc.
can be treated as random variables. The term “random” in the phrase “random variable”
refers to the uncertainty of a specific individual about the specific value that will be taken
by the variable. Note, in particular, that variable may not be intrinsically random but it
is random from the point of view of a specific individual because of their uncertainty. For
example, it is perfectly fine for me to treat Joe Biden’s current height as a random variable
even though it is actually non-random.

The distribution of a random variable is, informally, a description of the set of values that
the random variable takes and the probabilities with which it takes those values.

If a random variable X takes a finite or countably infinite set of possible values (in this
case, we say that X is a discrete random variable), its distribution is described by a listing
of the values a1 , a2 , . . . that it takes together with a specification of the probabilities:
P{X = ai } for i = 1, 2, . . . .
The function which maps ai to P{X = ai } is called the probability mass function (pmf) of
the discrete random variable X.

If a random variable takes a continuous set of values, its distribution is often described by
a function called the probability density function (pdf). We shall formally define this later.

6.1.1 Independence of Random Variables

We say that random variables X1 , . . . , Xn are independent if, for every subset S ⊆ {1, . . . k},
conditioning on any proposition involving Xi , i ∈ / S does not change the probability of any
proposition involving Xi , i ∈ S. From here one can easily derive properties of independence
such as
P{X1 ∈ A1 , . . . , Xk ∈ Ak } = P{X1 ∈ A1 }P{X2 ∈ A2 } . . . P{Xk ∈ Ak }
for all possible choices of A1 , . . . , Ak .

Independence can be a subtle concept in modeling. Suppose I am uncertain about X1 :=


Joe Biden’s height and X2 := Donald Trump’s height. Would it be a reasonable for me to
assume that X1 and X2 are independent?

6.2 Common Discrete Distributions

6.2.1 Bernoulli Ber(p) Distribution

A random variable X is said to have the Ber(p) (Bernoulli with parameter p) distribution
if it takes the two values 0 and 1 with P{X = 1} = p.

32
Note then that EX = p and V ar(X) = p(1 − p). For what value of p is X most variable?
least variable?

6.2.2 Binomial Bin(n, p) Distribution

A random variable X is said to have the Binomial distribution with parameters n and p (n
is a positive integer and p ∈ [0, 1]) if it takes the values 0, 1, . . . , n with pmf given by
 
n k
P{X = k} = p (1 − p)n−k for every k = 0, 1, . . . , n.
k

Here nk is the binomial coefficient:




 
n n!
:= .
k k!(n − k)!

The binomial distribution arises in basically two contexts:

1. Approximation of the hypergeometric distribution: In the last class, we looked


at the hypergeometric distribution which corresponds to the probabilities
R W
   
r n R(R − 1) . . . (R − r + 1)W (W − 1) . . . (W − w + 1)
N
w =
n
r N (N − 1) . . . (N − n + 1)
 
n R(R − 1) . . . (R − r + 1) W (W − 1) . . . (W − w + 1)
= .
r N (N − 1) . . . (N − r + 1) (N − r)(N − r − 1) . . . (N − n + 1)

This arises as the probability of seeing exactly r red balls when n (= r + w) balls are
drawn without replacement from an urn with R red balls and N −R white balls. When
N is much larger than n, we can write

R−i R/N − i/N R


= ≈
N −i 1 − i/N N

and
W −j W/N − j/N W
= ≈ .
(N − r − j) 1 − (r + j)/N N
As a result, the hypergeometric probability simplifies to
   r  w
n R W
r N N

which corresponds to the binomial distribution with parameters n and p := R/N .

2. Number of Successes in Repeated Trials: Suppose

X = X1 + · · · + Xn

where each Xi ∼ Ber(p) and X1 , . . . , Xn are independent. Then it can be checked that
X ∼ Bin(n, p).

Example 6.1 (Fairness testing). Suppose a coin is tossed 12 times leading to the outcome:
TTTTHTHTTTTH (this has 3 heads and 9 tails). What is your assessment of the fairness
of the coin?

33
For the usual frequentist answer to this question, we assume that the observed sequence
of outcomes are the realization of random variables X1 , . . . , Xn (with n = 12) that are inde-
pendently distributed according to the Bin(n, p) distribution for some unknown p. We need
to test the (null) hypothesis that p = 0.5 against, say, the alternative p < 0.5. This can be
done by calculating the p-value which is the probability (under the assumption p = 0.5) of
getting 3 or lower heads. The distribution of the number of heads under the null distribution
is Bin(n, 0.5) so the p-value is
       
12 12 12 12 1 299
+ + + 12
= = 0.073 = 7.3%
3 2 1 0 2 4096

which does not lead to a rejection of the null hypothesis at the usual 5% level.

6.2.3 Negative Binomial NB(n, p) distribution

Let X denote the number of tosses (of a coin with probability of heads p) required to get
the k th head. What is the probability distribution of X?

The distribution of X is given by the following. X takes the values k, k + 1, . . . and


 
k+i−1 k
P{X = k + i} = p (1 − p)i
i
(k + i − 1)(k + i − 2) . . . (k + 1)k k
= p (1 − p)i
i!
(−k)(−k − 1)(−k − 2) . . . (−k − i + 1) k
= (−1)i p (1 − p)i
  i!
i −k
= (−1) pk (1 − p)i for i = 0, 1, 2, . . . .
i

This is called the Negative Binomial distribution with parameters k and p (denoted by
N B(k, p)).

Example 6.2 (Fairness Testing (continued)). Let us get back to the fairness testing problem
in Example 6.1 where a coin was tossed 12 times leading to the outcome: TTTTHTHTTTTH
(this has 3 heads and 9 tails). In our previous p-value calculation, we implicitly assumed
that the experiment consisted of tossing the coin 12 times where 12 was a priori chosen by
the coin tosser. Consider now the alternative scenario where the coin tosser wanted to toss
the coin until the point where 3 heads are observed. Now for the same outcome, the p-value
will change. Indeed now the random variable of interest will become N = number of tosses
and the p-value will equal the probability of needing to toss the coin 12 or more times to get
the 3 heads (assuming fairness). This is calculated using the negative binomial distribution
as:
11  
X n − 1 −n 134
1− 2 = = 0.0327 = 3.27%
2 4096
n=3

and this leads to rejection of the null hypothesis at the 5% level.

Note that the “likelihood function” is the same function p3 (1−p)9 whether the sample size
was predetermined or whether the coin was tossed till 3 heads are observed. But the proce-
dure obtained for testing p = 0.5 has changed from the binomial to the negative binomial case.
This means that p-valued based frequentist inference violates the Likelihood Principle (the
likelihood principle states that “all the evidence in a sample relevant to model parameters is

34
contained in the likelihood function”). Here is a story from the wikipedia article on the “Like-
lihood Principle” (see https://en.wikipedia.org/wiki/Likelihood_principle) which
puts an interesting context to these numbers:

Suppose a number of scientists are assessing the probability of a certain outcome (which
we shall call ’success’) in experimental trials. Conventional wisdom suggests that if there is
no bias towards success or failure then the success probability would be one half. Adam, a
scientist, conducted 12 trials and obtains 3 successes and 9 failures. One of those successes
was the 12th and last observation. Then Adam left the lab.

Bill, Adam’s boss in the same lab, continued Adam’s work and published Adam’s results,
along with a significance test. He tested the null hypothesis that θ, the success probability, is
equal to a half, versus θ < 0.5 . The probability that out of 12 trials, 3 or fewer (i.e. more
extreme) were successes, if H0 is true, is 7.3%. Thus the null hypothesis is not rejected at
the 5% significance level.

Adam actually stopped immediately after 3 successes, because his boss Bill had instructed
him to do so. After the publication of the statistical analysis by Bill, Adam realizes that
he has missed a later instruction from Bill to instead conduct 12 trials, and that Bill’s
paper is based on this second instruction. Adam is very glad that he got his 3 successes
after exactly 12 trials, and explains to his friend Charlotte that by coincidence he executed
the second instruction. But Charlotte then explains to Adam that the p-value should now be
changed to 3.27% and the result becomes significant at the 5% level. Adam is astonished to
hear this.

For more comments on the violation of the likelihood principle by p-values, read MacKay
[5, Section 37.2].

To contrast with the above p-value based analysis, let us look at a Bayesian/probability
theory approach to this testing problem. The goal is to calculate:
P{fairness | data}
where data refers to T T T T HT HT T T T H. By the Bayes rule, we can write
P{data | fairness}P{fairness}
P{fairness | data} =
P{data | fairness}P{fairness} + P{data | not fair}P{not fair}
We clearly have
P{data | fairness} = 2−n .
What assignments do we use for
P{fairness}, P{not fair} and P{data | not fair}?
For concreteness, let us assume
P{fairness} = 0.5 and P{not fair} = 0.5. (26)
This is actually a very strong assumption in favor of fairness because a coin can be not fair
in many many variety of ways. So to assume that the probability of fairness is the same
as the combined probability of the many variety of ways in which the coin can be non-fair
seems quite strong.

Let us now come to P{data | not fair}. If the coin is not fair, we can assume that it has a
heads probability of p and that the coin tosses are still independent. We can then write
Z 1 Z 1
P{data | not fair} = P{data | not fair, p}fp|not fair (p)dp = p3 (1 − p)9 fp|not fair (p)dp.
0 0

35
To proceed further, we need to assign fp|not fair (p). One concrete assumption might be that
fp|not fair (p) = 1 for every p ∈ [0, 1]. (27)
This corresponds to the assumption that, under the alternative (not fair), p has the uniform
distribution on [0, 1]. Then (using an online integrator)
Z 1
1
P{data | not fair} = p3 (1 − p)9 dp = .
0 2860
We then get
2−12 ∗ 0.5
P{fairness | data} = 1 = 0.4111558.
2−12 ∗ 0.5 + 2860 ∗ 0.5
Note that this Bayesian probability calculation does not depend at all on whether the number
of tosses (n = 12) was decided a priori or whether it was decided to toss until getting 3 heads.
It is the same for both those cases.

Also note that the Bayesian approach (based on (26) and (27)) is only slightly supporting
the alternative hypothesis (roughly 60% to the null 40%) while the frequentist p-values are
fairly small indicating more evidence for the alternative. This discrepancy also persists when
the sample size is large. Consider the following example.

Example 6.3. In a certain city, 49581 boys and 48870 girls have been born over a certain
time period (note 49581/(49581 + 48870) = 0.5036109). Assuming that the number of male
births is binomially distributed with parameters n = 49581 + 48870 = 98451 and p, test the
hypothesis H0 : p = 0.5.

The usual frequentist p-value is:


P {Bin(n, 0.5) ≥ 49581} ≈ 0.01163
which is fairly small.

On the other hand, the Bayesian method above with the priors (26) and (27) gives
2−n ∗ 0.5
P {p = 0.5 | data} = = 0.950523.
2−n
∗ 0.5 + B(x + 1, n − x + 1) ∗ 0.5
R1
Here x = 49581, n = 98451 and B(α, β) = 0 pα−1 (1 − p)β−1 dp is the Beta function.

Thus the Bayesian method gives a high probability to the null while the frequentist method
will reject the null hypothesis. The reason why the Bayesian method is so supportive of the
null hypothesis is that the prior choice (26) gives strong support to p = 0.5 over nearby values
of p (such as p ∈ (0.49, 0.51)).

This discrepancy between the Bayesian and Frequentist solutions in this problem is referred
to as the Jeffreys-Lindley paradox (see https: // en. wikipedia. org/ wiki/ Lindley% 27s_
paradox ).

7 Lecture Seven

7.1 Geometric Distribution

The Geometric distribution is a special case of the Negative Binomial distribution for k = 1.
It corresponds to the number of independent tosses (of a coin with probability of heads

36
p) required to get the first head. Formally, we say that X has the Geometric distribution
with parameter p ∈ [0, 1] (written as X ∼ Geo(p)) if X takes the values 1, 2, . . . with the
probabilities:
P{X = i} = (1 − p)i−1 p for i = 1, 2, . . . .

The Geo(p) distribution has the interesting property of memorylessness i.e., if X ∼ Geo(p),
then
P {X > m + n|X > n} = P {X > m} . (28)
This is easy to check as P {X > m} = (1 − p)m . It is also interesting that the Geometric
distribution is the only distribution on {1, 2, . . . } which satisfies the memorylessness property
(28). To see this, suppose that X is a random variable satisfying (28) which takes values in
{1, 2, . . . }. Let G(m) := P{X > m} for m = 1, 2, . . . . Then (28) is the same as

G(m + n) = G(m)G(n).

This clearly gives G(m) = (G(1))m for each m = 1, 2, . . . . Now G(1) = P{X > 1} =
1 − P{X = 1}. If p = P{X = 1}, then

G(m) = (1 − p)m

which means that P{X = i} = P{X > i − 1} − P{X > i} = p(1 − p)i−1 for every i ≥ 1
meaning that X is Geo(p).

7.2 Poisson Distribution

A random variable X is said to have the Poisson distribution with parameter λ > 0 (denoted
by P oi(λ)) if X takes the values 0, 1, 2, . . . with pmf given by

λk
P{X = k} = e−λ for k = 0, 1, 2, . . . .
k!
The main utility of the Poisson distribution comes from the following fact:

Fact: The binomial distribution Bin(n, p) is well-approximated by the Poisson distribu-


tion P oi(np) provided that the quantity np2 is small.

To intuitively see why this is true, just see that

P {Bin(n, p) = 0} = (1 − p)n = exp (n log(1 − p)) .

Note now that 2


p p np being small implies that p is small (note that p can be written as
np /n ≤ np so small np2 will necessarily mean that p is small). When p is small,
2 2

we can approximate log(1 − p) as −p − p2 /2 so we get

P {Bin(n, p) = 0} = exp (n log(1 − p)) ≈ exp (−np) exp −np2 /2 .




Now because np2 is small, we can ignore the second term above to obtain that P{Bin(n, p) =
0} is approximated by exp(−np) which is precisely equal to P{P oi(np) = 0}. One can
similarly approximate P{Bin(n, p) = k} by P{P oi(np) = k} for every fixed k = 0, 1, 2, . . . .

There is a formal theorem (known as Le Cam’s theorem) which rigorously proves that
Bin(n, p) ≈ P oi(np) when np2 is small. This is stated without proof below (its proof is
beyond the scope of this class).

37
Theorem 7.1 (Le Cam’s Theorem). Suppose X1 , . . . , Xn are independent random variables
such that Xi ∼ Ber(pi ) for some pi ∈ [0, 1] for i = 1, . . . , n. Let X = X1 + · · · + Xn and
λ = p1 + . . . pn . Then

X n
X
|P{X = k} − P {P oi(λ) = k}| < 2 p2i .
k=0 i=1

In the special case when p1 = · · · = pn = p, the above theorem says that



X
|P{Bin(n, p) = k} − P{P oi(np) = k}| < 2np2
k=0

and thus when np2 is small, the probability P{Bin(n, p) = k} is close to P{P oi(np) = k} for
each k = 0, 1, . . . .

An implication of this fact is that for every fixed λ > 0, we have


 
λ
P oi(λ) ≈ Bin n, when n is large.
n

This is because when p = λ/n, we have np2 = λ2 /n which will be small when n is large.

This approximation property of the Poisson distribution is the reason why the Poisson
distribution is used to model counts of rare events. For example, it is common to use the
Poisson distribution to model the number of phone calls a telephone operator receives in a
day, the number of accidents in a particular street in a day, the number of typos found in a
book, the number of goals scored in a football game etc. Can you justify why the Poisson
distribution might be appropriate for these random variables?

7.3 Continuous Random Variables

Continuous random variables are random variables that potentially take a continuous set of
values. The distribution of a continuous random variable X is often described by a function
called the probability density function (pdf). The pdf is a function f on R that satisfies
f (x) ≥ 0 for every x ∈ R and Z ∞
f (x)dx = 1.
−∞
The pdf f of X can be used to calculate P{X ∈ A} for every set A via
Z
P{X ∈ A} = f (x)dx.
A

Note that if X has pdf f , then for every y ∈ R,


Z y
P{X = y} = f (x)dx = 0.
y

It is important to remember that the pdf f (x) of a random variable does not represent
probability (in particular, it is quite common for f (x) to take values much larger than one).
Instead, the value f (x) can be thought of as a constant of proportionality for probabilities.
This is because usually (as long as f is continuous at x):
1
lim P{x ≤ X ≤ x + δ} = f (x).
δ↓0 δ

38
If X is a continuous random variable with density (pdf) f , the expectation of g(X) is defined
as Z ∞
Eg(X) = g(x)f (x)dx. (29)
−∞
We shall next look at some standard Continuous Distributions.

7.4 Uniform Distribution

A random variable U is said to have the uniform distribution on (0, 1) if it has the following
pdf: 
1 :0<x<1
f (x) =
0 : for all other x
We write U ∼ U [0, 1].

More generally, given an interval (a, b), we say that a random variable U has the uniform
distribution on (a, b) if it has the following pdf:
 1
b−a :a<x<b
f (x) =
0 : for all other x

We write this as U ∼ U (a, b).

7.5 The Gaussian or Normal Distribution

A random variable X has the Gaussian or normal distribution with mean µ and variance
σ 2 > 0 if it has the following pdf:
(x − µ)2
 
2 1
ϕ(x; µ, σ ) := √ exp − .
2πσ 2 2σ 2
We write X ∼ N (µ, σ 2 ). When µ = 0 and σ 2 = 1, we say that X has the standard normal
distribution and the standard normal pdf is simply denote by ϕ(·):
 2
1 x
ϕ(x) = √ exp − .
2π 2

The following is the reason for the presence of the factor 2π above:
Z ∞ √
 2
x
exp − dx = 2π. (30)
−∞ 2
To see why (30) is true, note that
Z ∞  2  2 Z ∞ Z ∞  2
x + y2

x
exp − dx = exp − dxdy
−∞ 2 −∞ −∞ 2
Z ∞  2 Z ∞
r
= exp − (2πr)dr = 2π e−z dz = 2π.
0 2 0

From the above (and by a change of variable), we can derive


Z ∞ 
(x − µ)2
 √
exp − dx = 2πσ 2 .
−∞ 2σ 2

If X ∼ N (µ, σ 2 ), then E(X) = µ and V ar(X) = σ 2 .

39
7.5.1 The Gauss Derivation of the Normal Distribution

We shall next look at the Gauss derivation of the normal distribution. Gauss applied the
normal distribution in the context of data analysis. The basic question that Gauss addressed
is the following: Suppose we take measurements x1 , . . . , xn on some physical quantity θ.
When is x̄ := (x1 + · · · + xn )/n the right estimate for θ? Before the work of Gauss, some
prominent mathematicians had doubts about the use of x̄ to estimate θ. Jaynes (see Jaynes
[1, Section 7.4]) writes that Euler thought that combining many observations would make
their errors multiply instead of canceling. As is clear from quote below (taken from Jaynes
[1, Page 204]), Daniel Bernoulli thought that taking the average of observations amounts to
assuming that the individuals errors in the observations are uniformly distributed and that
assuming that the errors are uniformly distributed contradicts common sense:

Now is it not self-evident that the hits must be assumed to be thicker and more numerous
on any given band the nearer this is to the mark? If all the places on the vertical plane,
whatever their distance from the mark, were equally liable to be hit, the most skillful shot
would have no advantage over a blind man. That, however, is the tacit assertion of those
who use the common rule (the arithmetic mean) in estimating the value of various discrepant
observations, when they treat them all indiscriminately.

The quote above (by Daniel Bernoulli) is in the context of an archer shooting at a vertical
line drawn on a target and contemplating on the number of shoots landing on vertical bands
on either side of the vertical line.

Gauss showed that taking the average of the distributions is the right way of estimating θ
when the errors have the Gaussian distribution. Gauss first assumed that the errors have a
distribution f . More specifically, assume that

xi = θ + ϵi

for i = 1, . . . , n with
i.i.d
ϵ1 , . . . , ϵn ∼ f
for a density f . The maximum likelihood estimator of θ is then given by the maximizer θ̂ of
n
X
log f (xi − θ).
i=1

Letting g(u) = log f (u), we can say that θ̂ maximizes


n
X
g(xi − θ)
i=1

which means (assuming g is smooth)


n
X
g ′ (xi − θ̂) = 0.
i=1

Gauss asked for what density f is it true that the maximum likelihood estimator θ̂ equals
the mean x̄ for every dataset x1 , . . . , xn . More precisely, for what f (or equivalently g) do
we have
Xn
g ′ (xi − x̄) = 0 for every n ≥ 1 and x1 , . . . , xn .
i=1

40
Gauss showed that this equation leads to g ′ being the linear function:

g ′ (u) = au (31)

for some a ∈ R. This means g(u) = au2 /2 + b so that


 2 
au
f (u) = exp +b .
2

For f to be a density over (−∞, ∞), we need a < 0 in which case f will be normal:

u2
 
1
f (u) = √ exp − 2
2πσ 2 2σ

for some σ > 0. The density of the observations x1 , . . . , xn is then

(x − θ)2
 
1
√ exp − .
2πσ 2 2σ 2

Gauss therefore showed that the only distribution on the errors which leads to maximum
likelihood estimates being averages is the Normal. This accounts for the popularity (since
Gauss) of normal error assumptions in data analysis.

Here is the argument for (31). If we take n = 2, we get

g ′ ((x1 − x2 )/2) = −g ′ ((x2 − x1 )/2) for every x1 and x2

which means g ′ (0) = 0 and g ′ (−x) = −g ′ (x). Now taking

x1 = nu and x2 = · · · = xn = 0,

for some u (so that x̄ = u), we get

g ′ ((n − 1)u) + (n − 1)g ′ (−u) = 0

which gives (combining with g ′ (−x) = −g ′ (x))

g ′ ((n − 1)u) = (n − 1)g ′ (u).

Taking m = n − 1, we have proved that

g ′ (mu) = mg ′ (u) for every u ∈ R and m ≥ 0.

This can also be written as (replacing u by u/m) g ′ (u/m) = g ′ (u)/m and thus we have
n n
g′( u) = g ′ (u)
m m
for every n, m ≥ 1 which is same as g ′ (ru) = rg ′ (u) for every positive rational r and real
u. If we now assume that g ′ is continuous, we obtain g ′ (uv) = vg ′ (u) for every v > 0
and u which gives g ′ (u) = ug ′ (1). This proves (31) with a = g ′ (1). Something can still
be said without continuity (see the wiki article on the Cauchy Functional Equation https:
//en.wikipedia.org/wiki/Cauchy%27s_functional_equation).

For more comments on Gauss’s derivation of the normal distribution, see Jaynes [1, Section
7.4].

41
8 Lecture Eight

8.1 The Normal Distribution as an Approximation to the Binomial Distribution

The normal distribution shows up as an approximation to the Binomial Distribution and,


in fact, this was the way the normal density was first discovered by De Moivre. We shall go
over this today. Fix n ≥ 1 and p ∈ (0, 1). Suppose X ∼ Bin(n, p). Then
 
n k n!
P{X = k} = p (1 − p)n−k = pk (1 − p)n−k for k = 0, 1, . . . , n (32)
k k!(n − k)!

In what sense do the above probabilities resemble the normal density? To understand this,
the first step is to approximate the factorials using the Stirling Approximation. A brief
overview of Stirling’s approximation is discussed next.

8.1.1 Stirling Approximation

The Stirling Approximation is an approximation for n! that is quite accurate even for small
values of n. It states that
 n n √
n! ∼ 2πn.
e
The Stirling Approximation is quite accurate even for small n (even for n = 1) as can
be checked by evaluating the right hand side and comparing it to n!. The accuracy of
approximation can be gauged from the bound:
   
1 n! 1
exp < n n √ < exp .
12n + 1 e 2πn 12n

Here is a heuristic justification for the Stirling Approximation using the Laplace Method for
approximating integrals. We start with the basic formula
Z ∞
n! = xn e−x dx
0

which can be proved by induction over n. Rewrite the integral as


Z ∞
n! = exp (n log x − x) dx.
0

The change of variable x = ny gives


Z ∞
n! = nn · n exp (n(log y − y)) dy.
0

The function y 7→ log y − y attains its maximum value of −1 at y = 1. Because of the


presence of n in the exponent, the integral will be dominated by the points y which are close
to 1 (at least for large n). We shall therefore use the second order Taylor expansion:
1 1
g(y) := log y − y ≈ g(1) + g ′ (1)(y − 1) + g ′′ (1)(y − 1)2 ≈ −1 − (y − 1)2
2 2

42
in the exponent to get
Z ∞
n
 n 
n! ≈ n · n exp −n − (y − 1)2 dy
0 2
 n n Z ∞  n 
=n exp − (y − 1)2 dy
e 0 2
 n n Z ∞  n   n n r 2π  n n √
2
≈n exp − (y − 1) dy = n = 2πn
e −∞ 2 e n e

which is exactly the Stirling Approximation.

8.1.2 Entropy Approximation of Bin(n, p)

Let us now get back to the problem of approximating the binomial probabilities (32). A
reference for these calculations is Sinai [2, Chapter 3] (available for free through the Berkeley
Library website). Using the Stirling approximation
 m m √
m! ∼ 2πm
e
for each of the factorials in (32), we get
n n
 √
2πn
P{X = k} ≈ k √ e
pk (1 − p)n−k
k n−k n−k
 p
e 2πk e 2π(n − k)
 −k 
n − k −(n−k) k

1 k
=q p (1 − p)n−k
k n−k
2πn n n n n
  
1 f 1−f k
=p exp −n f log + (1 − f ) log where f :=
2πnf (1 − f ) p 1−p n

Note that f = k/n denotes the fraction of heads whose probability we are calculating. Using
the notation
f 1−f
D(f ∥p) := f log + (1 − f ) log ,
p 1−p
we have
1
P{Bin(n, p) = k} ≈ p exp [−nD(f ∥p)] . (33)
2πnf (1 − f )
The quantity D(f ∥p) is known variously as either the Relative Entropy of (f, 1 − f ) with
respect to (p, 1−p) or as the Kullback-Leibler divergence between (f, 1−f ) and (p, 1−p).
We shall refer to (33) as the Entropy Approximation to Bin(n, p). The relative entropy
D(f ∥p) has the following two important properties:

1. Nonnegativity: D(f ∥p) is always nonnegative. This is basically a consequence of the


elementary inequality log x ≤ x − 1 because
f 1−f
D(f ∥p) = f log + (1 − f ) log
p 1−p
   
p 1−p p 1−p
= −f log − (1 − f ) log ≥ −f − 1 − (1 − f ) − 1 = 0.
f 1−f f 1−f

2. Zero if and only if f = p: This basically follows from the above argument and the
fact that log x = x − 1 if and only if x = 1.

43
The quantity D(f ∥p) is often seen as a measure of discrepancy or distance or divergence
between the two discrete probability distributions (f, 1 − f ) and (p, 1 − p).

The entropy approximation (33) can be rewritten in the following way. The proportion
f represents the empirical proportion of heads while p represents the theoretical (or true)
proportion of heads. Thus
exp [−nD(f ∥p)]
P{Empirical Proportion of Heads and Tails is (f, 1 − f )} ∼ p
2πnf (1 − f )

which shows clearly how the probability decays the further (f, 1 − f ) moves from (p, 1 − p)
as measured by the Kullback-Leibler divergence. The subject “Large Deviations Theory” in
Probability extends such probability facts to more complicated scenarios.

8.1.3 Normal Approximation of Bin(n, p)

To obtain the normal approximation for the Binomial, we approximate


f 1−f
G(f ) := D(f ∥p) = f log + (1 − f ) log
p 1−p
by its Taylor expansion around f = p:
1 1
G(f ) = G(p) + G′ (p)(f − p) + G′′ (p)(f − p)2 + G′′′ (g)(f − p)3
2 6
for some g between f and p. One can directly verify (by calculating derivatives of G) that
1 2g − 1
G(p) = 0 G′ (p) = 0 G′′ (p) = G′′′ (g) = .
p(1 − p) g 2 (1 − g)2
As a result
(f − p)2 2g − 1
G(f ) = + 2 (f − p)3 .
2p(1 − p) 6g (1 − g)2
Plugging this in the formula for P{Bin(n, p) = k}, we obtain

(n − p)2 n(2g − 1)(f − p)3


   
1
P{Bin(n, p) = k} ∼ p exp − exp .
2πnf (1 − f ) 2p(1 − p) 6g 2 (1 − g)2

If the third term above is close to one, then we can drop it which will lead to the normal
approximation for P{Bin(n, p) = k}. In order to do so, we need

n(2g − 1)(f − p)3


Remainder :=
6g 2 (1 − g)2
to be small. Because |2g − 1| ≤ 1 (as g lies between 0 and 1) and g ≥ min(f, p) and
1 − g ≥ min(1 − p, 1 − f ), we can write

n|f − p|3
Remainder ≤ .
6 (min(f, p))2 (min(1 − f, 1 − p))2

If p is away from 0 and 1 and n|f − p|3 is small, then the above quantity will be small when
n is large. In this situation, we can ignore the remainder term to obtain the approximation:
(n − p)2
 
1
P{Bin(n, p) = k} ∼ p exp − .
2πnf (1 − f ) 2p(1 − p)

44
Also in the case when p is away from 0 and 1 and when n|f − p|3 is small, we have f /p is
close to 1 for large n. We can thus replace f by p in the multiplicative term to obtain
(n − p)2
 
1
P{Bin(n, p) = k} ∼ p exp −
2πnp(1 − p) 2p(1 − p)
(k − np)2
 
1
=p exp − .
2πnp(1 − p) 2np(1 − p)
The above is the normal density with mean np and variance np(1 − p) evaluated at k. We
thus have
P{Bin(n, p) = k} ≈ ϕ(k; np, np(1 − p)) (34)
where ϕ(x; µ, σ 2 ) denotes the normal density with mean µ and variance σ 2 evaluated at x.

(35) is the normal approximation for P{Bin(n, p) = k}. Note that this requires p to be
away from 0 and 1 and n|(k/n) − p|3 to small. If these conditions are violated, the normal
approximation will not be accurate. The Entropy Approximation, on the other hand, is
accurate for a much larger range of k (it is accurate as long as the Stirling Approximation
is accurate for n − k and k and the Stirling approximation is quite accurate even for small
integers).

For a concrete example, consider the following two situations:

1. Suppose n = 3000, k = 2500, p = 0.5. Here f = k/n = 5/6 which is quite far from p.
For example, n|f − p|3 is quite large. One can then verify on the computer that:
P{Bin(n, p) = k} ≈ 1.7 × 10−318 and ϕ(k; np, np(1 − p)) ≈ 4.3 × 10−292
Thus the normal approximation is off by many orders of magnitude. On the other
hand, the entropy approximation gives
exp [−nD(f ∥p)]
p ≈ 1.265 × 10−318
2πnf (1 − f )
which is quite close to P{Bin(n, p) = k}.

2. Suppose n = 3000, k = 1525, p = 0.5. Here f = 0.5083 which is quite close to p. Also
n ∗ |f − p|3 = 0.00173 is quite small. Then
P{Bin(n, p) = k} ≈ 0.0096037 and ϕ(k; np, np(1 − p)) ≈ 0.0096033
so the normal approximation is very accurate. The entropy approximation here is:
exp [−nD(f ∥p)]
p ≈ 0.0096032
2πnf (1 − f )

In both these situations, the entropy approximation is accurate while the normal approxi-
mation works well only in the second situation.

8.1.4 Implication for the chi-squared test

The Normal Approximation to the Binomial is the key ingredient in the popular Chi-Squared
test for goodness of fit. Because the normal approximation is not always valid, the chi-
squared test comes with certain warnings recommending against its use in some exceptional
cases (such as situations in which some of the cells have low counts). Use of the chi-squared
test in such situations leads to paradoxical conclusions. This is very nicely illustrated in the
following simple example (taken from Jaynes [1, Section 9.12]).

45
Example 8.1. Suppose that a coin toss can give three different results: H (heads), T (tails)
and edge (when the coin just stands on its edge). Suppose that a person A assigns proba-
bilities 0.499, 0.499, 0.002 to these three outcomes and another person B assigns probabilities
1/3, 1/3, 1/3 to these outcomes. Suppose now that we perform an experiment with this coin
by tossing it n = 29 times and this led to 14 heads, 14 tails and 1 edge.

We are now interested in measuring the fit between each of the two models (of person A
and B) and the observed data. If we use the chi-squared criterion:
X (observed count − expected count)2
expected count
all outcomes

for measuring goodness of fit, we would obtain


(14 − 29 × 0.499)2 (14 − 29 × 0.499)2 (1 − 29 × 0.002)2
χ2A = + + = 15.33
29 × 0.499 29 × 0.499 29 × 0.002
as the goodness of fit for A and
(14 − 29/3)2 (14 − 29/3)2 (1 − 29/3)2
χ2B = + + = 11.66
29/3 29/3 29/3
as the goodness of fit for B. This clearly runs against intuition as clearly A’s model seems
closer to the observed data compared to B. The reason for this strangeness is that the
underlying normal approximation is not working.

The right approach is simply to calculate probabilities of the observed data for each of the
two models. Specifically,
P {data | A} = 0.49914 0.49914 (0.002)1 = 7 × 10−12
and  14  14  1
1 1 1
P {data | B} = = 1.46 × 10−14 .
3 3 3
So A’s model assigns much higher probability (about 483 times higher) to the observed data
compared to B’s model. This certainly matches our intuition. Note that if we don’t know the
specific order of the 29 outcomes, we can multiply the above probabilities by the multinomial
coefficient
29!
14!14!1!
but this factor will not change anything as both the above probabilities will be multiplied by
this same factor.

The moral of this example is to always calculate binomial/multinomial probabilities directly


(or use the Entropy Approximation if an approximation is necessary). The normal approxi-
mation probabilities should be calculated only when one is sure that the normal approximation
is accurate.

9 Lecture Nine

9.1 Normal Approximation for the Binomial: CLT

In the last class, we studied the normal approximation to the Binomial. We proved that
(k − np)2
 
1
P {Bin(n, p) = k} ≈ ϕ(k; np, np(1 − p)) = p exp − (35)
2πnp(1 − p) 2np(1 − p)

46
This approximation is not always good. It is only accurate when f := k/n is closed to p.
More precisely, we argued in the last class that the approximation is good provided
n|f − p|3
(36)
6 (min(f, p))2 (min(1 − f, 1 − p))2
is small. Note the presence of n in the numerator above. This means that, for the approxi-
mation to be accurate, f needs to be much closer to p when n is large.

The normal approximation of the binomial is usually stated in the form of the De Moivre-
Laplace Central Limit Theorem as discussed below.

9.1.1 De Moivre-Laplace Central Limit Theorem

Fact 9.1. Let X ∼ Bin(n, p) with fixed 0 < p < 1. For every fixed pair of real numbers a
and b with a < b, we have
( ) Z
b  2
X − np 1 z
lim P a ≤ p ≤b = √ exp − dz. (37)
n→∞ np(1 − p) a 2π 2

Here is a sketch of the proof of (37). First write


( )
X − np n p p o
P a≤ p ≤ b = P np + a np(1 − p ≤ X ≤ np + b np(1 − p
np(1 − p)
X
= P{X = k}.
h √ √ i
k∈ np+a np(1−p),np+b np(1−p)

h p p i
The key now is to observe that, in the range k ∈ np + a np(1 − p), np + b np(1 − p) ,
the normal approximation is accurate. To see this, first note that
" r r #
k p(1 − p) p(1 − p)
f = ∈ p+a ,p + b
n n n

As a result, |f − p| is at most of order n−1/2 which implies that n|f − p|3 will be of order
n−1/2 and thus small (the denominator in (36) will behave like a constant as p is fixed away
from 0 and 1 and f is close to p). As a result, we can use the approximation (35) to write
( )
X − np X
P a≤ p ≤b ≈ ϕ (k; np, np(1 − p)) .
np(1 − p) h √ √ i
k∈ np+a np(1−p),np+b np(1−p)

Using the formula  


2 1 x−µ
ϕ(x; µ, σ ) = ϕ ,
σ σ
we get !
1 k − np
ϕ (k; np, np(1 − p)) = p ϕ p
np(1 − p) np(1 − p)
so that
( ) !
X − np X 1 k − np
P a≤ p ≤b ≈ p ϕ p
np(1 − p) h √ √ i np(1 − p) np(1 − p)
k∈ np+a np(1−p),np+b np(1−p)

47
Let us now write
k − np 1
zk = p so that zk − zk−1 = p .
np(1 − p) np(1 − p)
Thus ( )
X − np X
P a≤ p ≤b ≈ (zk − zk−1 )ϕ(zk ).
np(1 − p) z ∈[a,b] k
Rb
The left hand side above is a Riemann sum for a ϕ(z)dz and it approaches that integral as
n → ∞ (note that zk − zk−1 → 0 as n → ∞). This proves (37).

The statement (37) is also true if a = −∞ and/or b = +∞. This can be derived as a
consequence of (37) for finite a and b. This argument is technical and omitted; if interested,
see Corollary 3.2 of the book Probability Theory by Yakov G. Sinai.

9.2 The Exponential Distribution

The exponential distribution is given by the exponential density. The exponential density
with rate parameter λ > 0 (denoted by Exp(λ)) is given by

f (x) = λe−λx I{x > 0}.

It is arguably the simplest density for modeling random quantities that are constrained to
be nonnegative. It is used to model things such as the time of the first phone call that
a telephone operator receives starting from now. This can be justified by a discretization
argument as follows.

Suppose we divide the time starting now into a large number of small intervals each of
length δ. In each time interval, assume that there can be at most one phone call and that the
probability of a phone call is a small real number p. Also assume independence of getting
phone calls in distinct time intervals. In this setup, suppose X is the random variable
denoting the time of the first phone call. For a positive real number x,

P{x ≤ X < x + δ}

is the probability that there is no phone call in the first x/δ time intervals and that there is
a phone call in the (1 + (x/δ))th interval. Thus
x
P{x ≤ X < x + δ} ≈ (1 − p) δ p.

The quantity p is quite small so we can use the approximation 1 − p ≈ e−p . This gives
 p  p  p 
P{x ≤ X < x + δ} ≈ p exp − x = δ exp − x .
δ δ δ
Now the quantity pδ is the average number of phone calls in unit time (note that there are
1/δ intervals in unit time). This can therefore be termed as the “rate” of arrival of phone
calls:
p
λ= .
δ
We can then write
P{x ≤ X < x + δ} ≈ δλ exp (−λx)
which is same as
1
fX (x) = lim P{x ≤ X < x + δ} = λ exp (−λx)
δ↓0 δ

48
for x > 0.

Observe also that in the above setup, the distribution of the number of phone calls in any
time interval of length T is Bin(T /δ, p) (as the number of small time intervals each of length
δ in the original time interval of length T equals T /δ). The quantity
T 2  p 2
p = Tδ = T δλ2
δ δ
which is small if δ is small and T and λ are held fixed. The Poisson approximation holds
and we can approximate the distribution of the number of phone calls in any time interval
of length T as P oi(T pδ ) = P oi(λT ).

Also, by the assumption of independence, the number of phone calls in disjoint time
intervals will be independent.

These two assumptions (Poisson distribution of arrivals in any time interval and inde-
pendence of number of arrivals in disjoint time intervals) are characteristic of the Poisson
process. We have therefore assumed that the phone call arrivals form a Poisson process. The
waiting time for the first phone call then is Exponentially Distributed.

The exponential density has the memorylessness property (just like the Geometric distri-
bution in the discrete case). To see this, first note that
Z ∞
P{X > x} = λe−λu du = e−λx .
x

which gives

P{X > a + b} e−λ(a+b)


P {X > a + b|X > b} = = = e−λa = P{X > a}.
P{X > b} e−λb
The property

P{X > a + b | X > b} = P{X > a} for all a > 0, b > 0 (38)

is called memorylessness.

The exponential density is the only density on (0, ∞) that has the memorylessness property
(proof left as exercise). In this sense, the Exponential distribution can be treated as the
continuous analogue of the Geometric distribution. Note that a Geometric random variable
would satisfy (38) when a, b are integers but not when a, b are arbitrary real numbers.

9.3 The Gamma Distribution

It is customary to talk about the Gamma density after the exponential density. The Gamma
density with shape parameter α > 0 and rate parameter λ > 0 is given by

f (x) ∝ xα−1 e−λx I{x > 0}. (39)

To find the proportionality constant above, we need to evaluate


Z ∞ Z ∞
α−1 −λx 1
x e dx = α uα−1 e−u du.
0 λ 0

Now the function Z ∞


Γ(α) := uα−1 e−u du for α > 0
0

49
is called the Gamma function in mathematics. So the constant of proportionality in (39) is
given by
λα
Γ(α)
so that the Gamma density has the formula:
λα α−1 −λx
f (x) = x e I{x > 0}.
Γ(α)
We shall refer to this as the Gamma(α, λ) density.

Note that the Gamma(α, λ) density reduces to the Exp(λ) density when α = 1. Therefore,
Gamma densities can be treated as a generalization of the Exponential density. In fact, the
Gamma density can be seen as the continuous analogue of the negative binomial distribution
because if X1 , . . . , Xk are independent Exp(λ) random variables, then X1 + · · · + Xn ∼
Gamma(k, λ) (thus the Gamma distribution arises as the sum of i.i.d exponentials just as
the Negative Binomial distribution arises as the sum of i.i.d Geometric random variables).

Here are some elementary properties of the Gamma function that will be useful to us later.
The Gamma function does not have a closed form expression for arbitrary α > 0. However
when α is a positive integer k, it can be shown that

Γ(k) = (k − 1)! for k ≥ 1. (40)

The above inequality is a consequence of the property

Γ(α + 1) = αΓ(α) for α > 0 (41)

and the trivial fact that Γ(1) = 1. You can easily verify (41) by integration by parts.

Another easy Rfact about the√Gamma function is that Γ(1/2) = π (this is a consequence
2
of the fact that e−x /2 dx = 2π).

When α = k is a positive integer, the Γ(k, λ) distribution arises as the distribution of the
k th arrival in a Poisson process of rate λ. To see, consider the same binomial setup (that we
used for the modeling the arrivals of phone calls in the previous section). Then if X denotes
the waiting time for the k th phone call, then (for x > 0)

P{x ≤ X < x + δ}

is the probability that there are exactly k − 1 phone calls in the first x/δ small time intervals
(of length δ) and an additional phone call in the (x/δ + 1)th interval. Thus
 
x/δ x
P{x ≤ X < x + δ} ≈ pk (1 − p) δ −k+1 .
k−1
If x and k are fixed, then x/δ is much larger than k − 1 so we can approximate the right
hand side above as (also as p is small)

(x/δ)k−1 k
P{x ≤ X < x + δ} ≈ p (1 − p)x/δ .
(k − 1)!

Now using 1 − p ≈ e−p and writing λ = p/δ,

λk
P{x ≤ X < x + δ} ≈ δ xk−1 e−λx
(k − 1)!
which means that X has the Gamma(k, λ) density.

50
9.4 Variable Transformations

It is often common to take functions or transformations of random variables. Consider a


random variable X and apply a function u(·) to X to transform X into another random
variable Y = u(X). How does one find the distribution of Y = u(X) from the distribution
of X?

If X is a discrete random variable, then Y = u(X) will also be discrete and then the pmf
of Y can be written directly in terms of the pmf of X:
X
P{Y = y} = P{u(X) = y} = P{X = x}.
x:u(x)=y

If X is a continuous random variable with density fX and T (·) is a smooth function, then
it is fairly straightforward to write down the density of Y = T (X) in terms of fX . In the
case when T is invertible and T −1 is differentiable, there is the following formula:
dT −1 (y)
fT (X) (y) = fX (T −1 y) .
dy
We shall look at the ideas behind this formula (as well how to solve this problem when T is
not invertible) in the next class. We will look at the following problem in the next class.

Example 9.2. Suppose X ∼ U (−π/2, π/2). What is the density of Y = tan(X)? We shall
prove, in the next class, that Y has the Cauchy density:
1
fY (y) = for −∞ < y < ∞.
π(1 + y 2 )

10 Lecture Ten

10.1 Variable Transformations

It is often common to take functions or transformations of random variables. Consider a


random variable X and apply a function u(·) to X to transform X into another random
variable Y = u(X). How does one find the distribution of Y = u(X) from the distribution
of X?

If X is a discrete random variable, then Y = u(X) will also be discrete and then the pmf
of Y can be written directly in terms of the pmf of X:
X
P{Y = y} = P{u(X) = y} = P{X = x}.
x:u(x)=y

If X is a continuous random variable with density fX and T (·) is a smooth function, then
it is fairly straightforward to write down the density of Y = T (X) in terms of fX . There
are some general formulae for doing this but it is better to learn how to do it from first
principles. The general idea will be clear from the following two examples.

Example 10.1. Suppose X ∼ U (−π/2, π/2). What is the density of Y = tan(X)? Here is
the method for doing this from first principles. Note that the range of tan(x) as x ranges
over (−π/2, π/2) is R so fix y ∈ R and we shall find below the density g of Y at y.

51
The formula for g(y) is
1
g(y) = lim P{y < Y < y + δ}
δ↓0 δ

so that
P{y < Y < y + δ} ≈ g(y)δ when δ is small. (42)
Now for small δ,

P{y < Y < y + δ} = P{y < tan(X) < y + δ}


= P{arctan(y) < X < arctan(y + δ)}
≈ P{arctan(y) < X < arctan(y) + δ arctan′ (y)}
δ
= P{arctan(y) < X < arctan(y) + }
1 + y2
δ
≈ f (arctan(y)) .
1 + y2
where f is the density of X. Comparing the above with (42), we can conclude that
1
g(y) = f (arctan(y))
1 + y2
Using now the density of X ∼ U (−π/2, π/2), we deduce that
1
g(y) = for y ∈ R.
π(1 + y 2 )
This is the Cauchy density.

The answer derived in the above example is a special case of the following formula:
dT −1 (y)
fT (X) (y) = fX (T −1 y) . (43)
dy
which makes sense as long as T is invertible and T −1 is differentiable. If the function T is not
invertible, then the formula above cannot be directly used but the method (based on first
principles) used to derive the above formula is applicable in all cases. Here is an example
illustrating this.

Example 10.2. Suppose X has the standard normal density. What is the density of Y =
X 2?

The function T (x) = x2 is not invertible so the formula (43) cannot be used directly.
Instead we argue from first principles as follows. Let y > 0. The density of Y at y is given
by
1
fY (y) = lim P{y < Y < y + δ}.
δ↓0 δ

For small δ > 0, we can write


√ p p √
P{y < Y < y + δ} = P{ y < X < y + δ} + P{− y + δ < X < − y}
√ √
√ √ d y √ d y √
≈ P{ y < X < y + δ } + P{− y − δ < X < − y}
dy dy
√ √ δ √ δ √
= P{ y < X < y + √ } + P{− y − √ < X < − y}
2 y 2 y

√ δ √ δ ϕ( y)
≈ ϕ( y) √ + ϕ(− y) √ = √ δ.
2 y 2 y y

52
Thus √
ϕ( y) 1
fY (y) = √ = √ y −1/2 e−y/2 for y > 0.
y 2π
This is the density of the chi-squared random variable with 1 degree of freedom. This is also
the Gamma random variable with shape parameter α = 0.5 and rate parameter λ = 0.5.

Note that we can also try to calculate the density of Y at 0 by the above method:
√ √ √
P{0 < Y < δ} = P{− δ < X < δ} ≈ 2ϕ(0) δ

so that √
1 δ
fY (0) = lim P{0 < Y < δ} = 2ϕ(0) lim = 2ϕ(0) lim δ −1/2 = ∞.
δ↓0 δ δ↓0 δ δ↓0

This ∞ does not affect any R calculation of probabilities P{Y ∈ A} as these densities are
calculated by the integral A fY (y)dy and the value of fY at the one point 0 does not affect
the value of this integral.

10.2 The Cumulative Distribution Function and the Quantile Transform

The cumulative distribution function (cdf) of a random variable X is the function defined as

F (x) := P{X ≤ x} for −∞ < x < ∞.

This is defined for all random variables discrete or continuous. The cdf of every random
variable has the following properties: (a) It is non-decreasing, (b) right-continuous, (c)
limx↓−∞ F (x) = 0 and limx↑+∞ F (x) = 1.

If the random variable X has a density fX , then its cdf is given by


Z x
F (x) = fX (t)dt
−∞

and, in this case, it is generally true that F ′ (x) = fX (x).

The inverse of the cdf is used to define quantiles. Given a random variable X and a
number u ∈ (0, 1), the u-quantile of the distribution of X is given by a real number qX (u)
satisfying
P {X ≤ qX (u)} = u (44)
provided such a number qX (u) exists uniquely. If FX is the cdf of X, the equation (44)
simply becomes
FX (qX (u)) = u
so we can write
qX (u) = FX−1 (u).
Here are some simple examples.

Example 10.3 (Uniform). Suppose X has the uniform distribution on (0, 1). Then FX (x) =
x for x ∈ (0, 1) and thus FX−1 (u) exists uniquely for every u ∈ (0, 1) and equals u. We thus
have qX (u) = u for every u ∈ (0, 1).

53
Example 10.4 (Normal). Suppose X has the standard normal distribution. Then FX (x) =
Φ(x) where Z x Z x  2
−t
Φ(x) = ϕ(t)dt = (2π)−1/2 exp dt.
−∞ −∞ 2
There is no closed form expression for Φ but its values can be obtained in R (for example)
using the function pnorm. Φ is a strictly increasing function from (−∞, ∞) to (0, 1) so its
inverse exists uniquely and we thus have

qX (u) = Φ−1 (u) for every u ∈ (0, 1).

There is no closed form expression for qX = Φ−1 but its values can be obtained from R by
the function qnorm.

Example 10.5 (Cauchy). Suppose X has the standard Cauchy density:


1 1
fX (x) := for −∞ < x < ∞.
π 1 + x2
Its cdf is given by
Z x x
1 dt 1 1 1
FX (x) = 2
= arctan(t) = arctan(x) + .
−∞ π1+t π −∞ π 2

It is easy to see that this is a strictly increasing function from (∞, ∞) to (0, 1) and its inverse
is given by
FX−1 (u) = tan (π (u − 0.5)) .
Thus the quantile function for the Cauchy distribution is given by

qX (u) = tan (π (u − 0.5)) for every u ∈ (0, 1).

How to define the u-quantile when there is no solution or multiple solutions to the equation
FX (q) = u? No solutions for FX (q) = u can happen for discrete distributions (for example,
for X ∼ Ber(0.5) and u = 0.25, there is no q satisfying P{X ≤ q} = u). Multiple solutions
can also happen. For example, if X is uniformly distributed on the set [0, 1] ∪ [2, 3] and
u = 0.5, then every q ∈ [1, 2] satisfies FX (q) = 0.5. In such cases, it is customary to define
the u-quantile via
qX (u) := inf{x ∈ R : FX (x) ≥ u}. (45)
This can be seen as a generalization of FX−1 (u). Indeed, if there is a unique q such that
FX (q) = u, it is easy to see then that qX (u) = q.

The function qX : (0, 1) → (−∞, ∞) defined by (45) is called the quantile function or
the quantile transform of the random variable X. It can be checked that the definition (45)
ensures that
P{X ≤ qX (u)} ≥ u and P{X < qX (u)} ≤ u. (46)

Example 10.6 (Bernoulli). Suppose X ∼ Ber(p) i.e., P{X = 0} = 1−p and P{X = 1} = p.
When then is qX (u) for u ∈ (0, 1)? It can be checked that

0 for 0 < u ≤ 1 − p
qX (u) =
1 for 1 − p < u < 1

The quantile transform is important for the following reason.

54
Proposition 10.7. The following two statements are true.

1. Suppose U is a random variable distributed according to the uniform distribution on


(0, 1). Then qX (U ) has the same distribution as X. In other words, the function qX
transforms the uniform distribution to the distribution of X.

2. Suppose X is a random variable with a continuous cdf FX . Then FX (X) has the uni-
form distribution on (0, 1). In other words, the function FX transforms the distribution
of X into the U nif (0, 1) distribution (provided the distribution of X is continuous).

It should be stressed that the first conclusion of the above Proposition holds for every X
(discrete or continuous) while the second conclusion is only true if FX is continuous. To see
why the second conclusion is false when FX is not continuous, suppose that X ∼ Ber(p)
so that X takes only the two values 0 and 1. Then FX (X) also takes only two values:
FX (0) = 1 − p and FX (1) = 1; thus FX (X) cannot have the uniform distribution on (0, 1).

Example 10.8 (Cauchy). We have seen in Example 10.5 that for a standard Cauchy random
variable, qX (u) = tan(π(u − 0.5)). Proposition 10.7 then gives that if U ∼ U nif (0, 1), then

tan(π(U − 0.5)) ∼ Cauchy.

Note that π(U − 0.5) ∼ U nif (−π/2, π/2). Thus the tan function applied to a uniformly
distributed random variable on (−π/2, π/2) results in a random variable having the Cauchy
distribution (as we have seen in Example 10.1).

Example 10.9 (Bernoulli). The quantile function for Ber(p) was calculated in (10.6) as

q(u) = I{1 − p < u < 1}.

As a result, the first conclusion of Proposition 10.7 states that, for U ∼ U nif (0, 1),

q(U ) = I{1 − p < U < 1} ∼ Ber(p)

as can be checked directly. Note that this function q is not the only function with the property
that q(U ) ∼ Ber(p). For example, the function q̃(u) = I{0 < u < p} also satisfies q̃(U ) ∼
Ber(p).

11 Lecture Eleven

11.1 Joint Densities

Joint densities are used to describe the distribution of a finite set of continuous random
variables. We focus on bivariate joint densities (i.e., when there are two continuous variables
X and Y ). The ideas are the same for the case of more than two variables.

A real-valued function of two-variables f (·, ·) is called a joint density if


Z ∞Z ∞
f (x, y) ≥ 0 for all x, y and f (x, y)dxdy = 1.
−∞ −∞

We say that two random variables X and Y have joint density f (·, ·) if
Z Z Z Z
P {(X, Y ) ∈ B} = f (x, y)dxdy = I{(x, y) ∈ B}f (x, y)dxdy.
B

55
for every subset B of R2 . We shall often denote the joint density of (X, Y ) by fX,Y .

Here are two simple examples of joint densities.

Example 11.1. Consider the function


 2
x + y2

1
f (x, y) = exp − .
2π 2
First note that this is a valid joint density as this function is always nonnegative and
Z ∞Z ∞ Z ∞Z ∞  2
x + y2

1
f (x, y)dxdy = exp − dxdy
−∞ −∞ −∞ −∞ 2π 2
Z ∞  2   Z ∞  2 
1 x 1 y
= √ exp − dx √ exp − dy = 1
−∞ 2π 2 −∞ 2π 2

Suppose a pair of random variables (X, Y ) have this as their joint density. Then probabilities
involving X, Y are calculated by integrating the density f (x, y) in the appropriate region. For
example,
Z Z
P{−1 ≤ X + Y ≤ 2} = I{(x, y) : −1 ≤ x + y ≤ 2}f (x, y)dxdy
Z Z  
1 1 2 2
= I{(x, y) : −1 ≤ x + y ≤ 2} exp − (x + y ) dxdy
2π 2

The set {(x, y) : −1 ≤ x + y ≤ 2} represents the region betwen the two lines x + y = 2 and
x + y = −1.

Example 11.2. Consider the function



1 : 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1
f (x, y) =
0 : otherwise

This function takes the value 1 on the set {(x, y) : 0 ≤ x ≤ 1, 0 ≤ y ≤ 1} and can also be
written succinctly as
f (x, y) = I{0 ≤ x, y ≤ 1}.
This is clearly a density function as this is nonnegative and integrates to one (the area of
the unit square {(x, y) : 0 ≤ x ≤ 1, 0 ≤ y ≤ 1} equals 1). Suppose the random variables X
and Y have this joint density f , then
Z Z
P{(X, Y ) ∈ B} = I{(x, y) ∈ B}f (x, y)dxdy
Z Z
= I{(x, y) ∈ B}I{0 ≤ x, y ≤ 1}dxdy

= area of B ∩ {(x, y) : 0 ≤ x, y ≤ 1}

For example,
 π
P{X 2 + Y 2 ≤ 1} = area of {(x, y) : 0 ≤ x, y ≤ 1 and x2 + y 2 ≤ 1} = .
4

In order to calculate joint densities, the following formula is very useful. If ∆ is a small
region in R2 around a point (x, y), we have

P{(X, Y ) ∈ ∆} ≈ (area of ∆) fX,Y (x, y). (47)

56
More formally,
P{(X, Y ) ∈ ∆}
lim = fX,Y (x, y)
∆↓(x,y) area of ∆
where the limit is taken as ∆ shrinks to (x, y). Here are two special cases of this formula:

1. By taking ∆ to be the rectangle {(a, b) : x ≤ a ≤ x + δ, y ≤ b ≤ y + ϵ} for small δ and


ϵ, we get
P{x ≤ X ≤ x + δ, y ≤ Y ≤ y + ϵ}
fX,Y (x, y) = lim . (48)
δ↓0,ϵ↓0 δϵ

2. By taking ∆ to be the circle centered at (x, y) of radius r, we get

P{(X − x)2 + (Y − y)2 ≤ r2 }


fX,Y (x, y) = lim .
r↓0 πr2

The usefulness of the formula (47) is illustrated in the following two examples.

Example 11.3. Suppose (X, Y ) have the joint density:


 2
x + y2

1
f (x, y) = exp − .
2π 2

Define two new random variables R and Θ as follows: R := X 2 + Y 2 and Θ is the angle
made by the vector (X, Y ) with the positive X-axis (measured from the positive X-axis in the
counterclockwise direction) (note that Θ takes values between 0 and 2π). What is the joint
density of (R, Θ)?

Let us calculate the joint density fR,Θ (r, θ) of (R, Θ) at a fixed point (r, θ). Let us assume
that r > 0 and 0 < θ < 2π. One way of calculating this is via (48):

P{r < R < r + δ, θ < Θ < θ + ϵ}


fR,Θ (r, θ) = lim (49)
δ,ϵ↓0 δϵ

We can calculate P{r < R < r + δ, θ < Θ < θ + ϵ} in the following way:

P{r < R < r + δ, θ < Θ < θ + ϵ} = P{(X, Y ) ∈ S}


p
where S is the set of all points (x, y) such that r < x2 + y 2 < r + δ and the angle made
by (x, y) with the positive x-axis lies between θ and θ + ϵ. As can be seen from Figure 2,
when δ, ϵ are small, the set S is a small region around the point (r cos θ, r sin θ). Moreover,
its area is approximately equal to rϵδ. We thus get

P{(X, Y ) ∈ S} ≈ fX,Y (r cos θ, r sin θ) × area of S


(r cos θ)2 + (r sin θ)2
   2
1 r 1
= exp − rδϵ = r exp − δϵ.
2π 2 2 2π

Combining the above with (58), we deduce that


 2
r 1
fR,Θ (r, θ) = r exp − whenever r > 0, 0 < θ < 2π
2 2π

Example 11.4. Suppose X, Y have joint density fX,Y . What is the joint density of U and
V where U = X and V = X + Y ?

57
Figure 2: The set S

We see that (U, V ) = T (X, Y ) where T (x, y) = (x, x + y). This transformation T is clearly
invertible and its inverse is given by S(u, v) = T −1 (u, v) = (u, v − u). In order to determine
the joint density of (U, V ) at a point (u, v), let us consider

P{u ≤ U ≤ u + δ, v ≤ V ≤ v + ϵ} ≈ δϵfU,V (u, v). (50)

Let R denote the rectangle joining the points (u, v), (u + δ, v), (u, v + ϵ) and (u + δ, v + ϵ).
Then the above probability is the same as

P{(U, V ) ∈ R} = P{(X, Y ) ∈ S(R)}

where S(R) is the image of the rectangle R under the mapping S. How does S(R) look like?
It is the parallelogram joining the points (u, v − u), (u + δ, v − u − δ), (u, v − u + ϵ) and
(u + δ, v − u + ϵ − δ). When δ and ϵ are small, S(R) is clearly a small region around (u, v − u)
which allows us to write

P{(U, V ) ∈ R} = P{(X, Y ) ∈ S(R)} ≈ fX,Y (u, v − u) (area of S(R)) .

The area of the parallelogram S(R) can be computed to be δϵ (using the formula that the area
of a parallelogram equals base times height) so that

P{(U, V ) ∈ R} ≈ fX,Y (u, v − u)δϵ.

Comparing with (50), we obtain

fU,V (u, v) = fX,Y (u, v − u).

This gives the formula for the joint density of (U, V ) in terms of the joint density of (X, Y ).

The logic behind the above two examples can be extended to obtain formulae for the joint
density of an arbitrary transformation of a pair of random variables with known joint density.
We shall first consider linear transformations (as in Example 11.4) and, in the next class,
consider nonlinear transformations.

58
11.2 Joint Densities under General Linear Invertible transformations

Let us first recall some basic properties of linear transformations.

11.2.1 Linear Transformations

By a linear transformation L : R2 → R2 , we mean a function that is given by


 
x
L(x, y) := M +c (51)
y

where M is a 2 × 2 matrix and c is a 2 × 1 vector. The first term on the right hand side
above involves multiplication of the 2 × 2 matrix M with the 2 × 1 vector with components
x and y.

We shall refer to the 2 × 2 matrix M as the matrix corresponding to the linear transfor-
mation L and often write ML for the matrix M .

The linear transformation L in (51) is invertible if and only if the matrix M is invert-
ible. We shall only be dealing with invertible linear transformations. The following are two
standard properties of linear transformations that you need to familiar with.

1. If P is a parallelogram in R2 , then L(P ) is also a parallelogram in R2 . In other words,


linear transformations map parallelograms to parallelograms.

2. For every parallelogram P , the following identity holds:

area of L(P )
= | det(ML )|.
area of P
In other words, the ratio of the areas of L(P ) to that of P is given by the absolute
value of the determinant of the matrix ML .

11.2.2 Invertible Linear Transformations

Suppose X, Y have joint density fX,Y and let (U, V ) = T (X, Y ) for a linear and invertible
transformation T : R2 → R2 . Let the inverse transformation of T be denoted by S. In the
previous example, we had T (x, y) = (x, x + y) and S(u, v) = (u, v − u). The fact that T is
assumed to be linear and invertible means that S is also linear and invertible.

To compute fU,V at a point (u, v), we consider

P{u ≤ U ≤ u + δ, v ≤ V ≤ v + ϵ} ≈ fU,V (u, v)δϵ

for small δ and ϵ. Let R denote the rectangle joining the points (u, v), (u + δ, v), (u, v + ϵ)
and (u + δ, v + ϵ). Then the above probability is the same as

P{(U, V ) ∈ R} = P{(X, Y ) ∈ S(R)}.

What is the region S(R)? Clearly now S(R) is a small region (as δ and ϵ are small) around
the point S(u, v) so that

P{(U, V ) ∈ R} = P{(X, Y ) ∈ S(R)} ≈ fX,Y (S(u, v)) (area of S(R)) .

59
By the facts mentioned in the previous subsection, we now note that S(R) is a parallelogram
whose area equals |det(MS )| multiplied by the area of R (note that the area of R equals δϵ).
We thus have

fU,V (u, v)δϵ ≈ P {(U, V ) ∈ R} = P {(X, Y ) ∈ S(R)} = fX,Y (S(u, v))| det(MS )|δϵ

which allows us to deduce that

fU,V (u, v) = fX,Y (S(u, v)) |det MS | . (52)

Remember again that MS is the 2 × 2 matrix corresponding to the linear transformation S.

In the next class, we shall see extensions of the formula (54) for nonlinear transformations.

12 Lecture Twelve

12.1 Last Class: Joint Density

In the last lecture, we started discussing joint densities. Any function f (x, y) of two real
variables x and y is a joint density provided it is nonnegative and integrates (over both x
and y) to 1. If two random variables X and Y have joint density fX,Y , then
Z Z
P {(X, Y ) ∈ B} = I{(x, y) ∈ B}fX,Y (x, y)dxdy

for every set B ⊆ R2 . One further has


P{(X, Y ) ∈ ∆}
fX,Y (x, y) = lim
∆↓(x,y) area of ∆
which implies that
P{(X, Y ) ∈ ∆} ≈ fX,Y (x, y) × area of ∆ (53)
provided ∆ is a small region around the point (x, y). The precise shape of the small region
∆ is immaterial for (53).

12.2 Marginal Densities corresponding to a Joint Density

Suppose X and Y have joint density fX,Y . Then probabilities involving only the random
variable X can be calculated as:

P{X ∈ A} = P{X ∈ A, −∞ < Y < ∞}


Z Z
= fx,Y (x, y)I{x ∈ A, −∞ < y < ∞}dxdy
Z Z ∞ 
= I{x ∈ A} fX,Y (x, y)dy dx
−∞
Z  Z ∞ 
= fX,Y (x, y)dy dx.
A −∞

Comparing this with the formula for the density fX (x) of a single random variable X:
Z
P{X ∈ A} = fX (x)dx,
A

60
we immediately deduce that fX (x) can be written in terms of fX,Y (x, y) as:
Z
fX (x) = fX,Y (x, y)dy.

Analogous, the density fY (y) of Y is given by:


Z
fY (y) = fX,Y (x, y)dx.

In words, the density of a single random variable can be derived by integrating the joint
density (of this random variable and another random variable) with respect to the other
variable.

When discussing a joint density fX,Y , individual densities fX of X and fY of Y are referred
to as marginal densities.

12.3 Independence in terms of Joint Densities

Independence of two random variables X and Y can be characterized in terms of their


joint density fX,Y using any of the following statements. The following statements are all
equivalent to each other:

1. The random variables X and Y are independent.

2. The joint density fX,Y (x, y) factorizes into the product of a function depending on x
alone and a function depending on y alone.

3. fX,Y (x, y) = fX (x)fY (y) for all x, y.

Example 12.1. The joint density



1 : 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1
f (x, y) =
0 : otherwise

factorizes into the product of a function depending on x alone and a function depending on
y alone because

f (x, y) = I{0 ≤ x ≤ 1, 0 ≤ y ≤ 1} = I{0 ≤ x ≤ 1}I{0 ≤ y ≤ 1}

The factorization above immediately says that if f = fX,Y , then X and Y are independent.
The marginal densities of X and Y are uniform densities on [0, 1].

Example 12.2. Suppose X, Y have the joint density


1
fXY (x, y) = I{x2 + y 2 ≤ 1}.
π
Show that the marginal density of X is given by
2p
fX (x) = 1 − x2 I{−1 ≤ x ≤ 1}.
π
Are X and Y independent? (Ans: No. Why?)

61
12.4 How linear transformations change joint densities

In the last lecture, we looked at the following fact. Suppose X, Y have joint density fX,Y
and let (U, V ) = T (X, Y ) for a linear and invertible transformation T : R2 → R2 . Let the
inverse transformation of T be denoted by S. Then the joint density of U and V is given by

fU,V (u, v) = fX,Y (S(u, v)) |det MS | . (54)

where MS is the 2 × 2 matrix corresponding to the linear transformation S.

As an example suppose U = X and V = X + Y so that T (x, y) = (x, x + y) and S(u, v) =


(u, v − u). The matrix corresponding to S is
 
1 0
MS = .
−1 1

The determinant of MS is clearly 1. The formula (54) then gives

fU,V (u, v) = fX,Y (u, v − u).

We shall next study the problem of obtaining the joint densities under differentiable and
invertible transformations that are not necessarily linear.

12.5 General Invertible Transformations

Let (X, Y ) have joint density fX,Y . We transform (X, Y ) to two new random variables
(U, V ) via (U, V ) = T (X, Y ). What is the joint density fU,V ? Suppose that T is invertible
(having an inverse S = T −1 ) and differentiable. Note that S and T are not necessarily linear
transformations.

In order to compute fU,V at a point (u, v), we consider

P{u ≤ U ≤ u + δ, v ≤ V ≤ v + ϵ} ≈ fU,V (u, v)δϵ

for small δ and ϵ. Let R denote the rectangle joining the points (u, v), (u + δ, v), (u, v + ϵ)
and (u + δ, v + ϵ). Then the above probability is the same as

P{(U, V ) ∈ R} = P{(X, Y ) ∈ S(R)}.

What is the region S(R)? If S is linear then S(R) (as we have seen in the last class) will be
a parallelogram. For general S, the main idea is that, as long as δ and ϵ are small, the region
S(R) can be approximated by a parallelogram. This is because S itself can be approximated
by a linear transformation on the region R. To see this, let us write the function S(a, b) as
(S1 (a, b), S2 (a, b)) where S1 and S2 map points in R2 to R. Assuming that S1 and S2 are
differentiable, we can approximate S1 (a, b) for (a, b) near (u, v) by
  
∂ ∂ a−u
S1 (a, b) ≈ S1 (u, v) + S1 (u, v), S1 (u, v)
∂u ∂v b−v
∂ ∂
= S1 (u, v) + (a − u) S1 (u, v) + (b − v) S1 (u, v).
∂u ∂v
Similarly, we can approximate S2 (a, b) for (a, b) near (u, v) by
  
∂ ∂ a−u
S2 (a, b) ≈ S2 (u, v) + S2 (u, v), S2 (u, v) .
∂u ∂v b−v

62
Putting the above two equations together, we obtain that, for (a, b) close to (u, v),
∂ ∂
 
∂u S1 (u, v) ∂v S1 (u, v) a−u
S(a, b) ≈ S(u, v) + ∂ ∂ .
∂u S2 (u, v) ∂v S2 (u, v)
b−v
Therefore S can be appromixated by a linear transformation with matrix given by
∂ ∂

∂u S1 (u, v) ∂v S1 (u, v)
JS (u, v) := ∂ ∂
∂u S2 (u, v) ∂v S2 (u, v)

for (a, b) near (u, v). Note that, in particular, when δ and ϵ are small, that this linear
appximation for S is valid over the region R. The matrix JS (u, v) is called the Jacobian
matrix of S(u, v) = (S1 (u, v), S2 (u, v)) at the point (u, v).

Because of the above linear approximation, we can write

P {(X, Y ) ∈ S(R)} ≈ fX,Y (S(u, v)) |det(JS (u, v))| (area of R)

This gives us the important formula

fU,V (u, v) = fX,Y (S(u, v)) |det JS (u, v)| . (55)

The following is an example of this formula (we derived the result in this example using first
principles in the last class)

Example 12.3. Suppose X and Y are two random variables having joint√density fX,Y .
Define two new random variables R and Θ in the following way. R := X 2 + Y 2 and
Θ is the angle made by the vector (X, Y ) with the positive X-axis in the counterclockwise
direction. What is the joint density of (R, Θ)?

Clearly (R, Θ) = T (X, Y ) where the inverse of T is given by S(r, θ) = (r cos θ, r sin θ). The
density of (R, Θ) at (r, θ) is zero unless r > 0 and 0 < θ < 2π. The formula (55) then gives
 
cos θ −r sin θ
fR,Θ (r, θ) = fX,Y (r cos θ, r sin θ) det = fX,Y (r cos θ, r sin θ) × (r)
sin θ r cos θ
We have thus derived the formula:

fR,Θ (r, θ) = rfX,Y (r cos θ, r sin θ) for every r > 0 and 0 < θ < 2π. (56)

We can also write this formula as:


1 p
fX,Y (x, y) = p fR,Θ ( x2 + y 2 , θ(x, y)) for every −∞ < x, y < ∞. (57)
x2 + y 2
p
where x2 + y 2 represents r and θ(x, y) is the made by the vector (x, y) with the positive
X-axis in the counterclockwise direction.

The formulae (56) and (57) have an important connection to the Herschel-Maxwell deriva-
tion of the normal distribution which we discuss next.

12.6 The Herschel-Maxwell Derivation of the Normal Distribution

In the context of Example 12.3, the astronomer John Herschel derived the normal distribution
in the following way (this derivation was extended to the three dimensional case by the
physicist James Clerk Maxwell). See Jaynes [1, Section 7.2] for more details. Their result is
the following.

63
Fact 12.4. Suppose X and Y are two random variables. Suppose that R and Θ are defined
as in Example 12.3. Assume the following three conditions:

1. X and Y are independent and identically distributed

2. R and Θ are independent

3. Θ is uniformly distributed on (0, 2π).

Then X and Y have the normal distribution N (0, σ 2 ) with mean zero and some variance σ 2 .

Before proving (12.4), let us note that if X and Y are independently distributed as
N (0, σ 2 ), then all the conditions above are true. To see this, observe first that, by inde-
pendence,

x2 y2
 2
x + y2
    
1 1 1
fX,Y (x, y) = fX (x)fY (y) = √ exp − 2 √ exp − 2 = exp −
2πσ 2σ 2πσ 2σ 2πσ 2 2σ 2

The formula (56) then gives

(r cos θ)2 + (r sin θ)2 r2


   
1 1
fR,Θ (r, θ) = rfX,Y (r cos θ, r sin θ) = r exp − =r exp − 2 .
2πσ 2 2σ 2 2πσ 2 2σ

The above is the joint density of R and Θ at (r, θ) provided r > 0 and 0 < θ < 2π. To make
the ranges of the variables r and θ clear, we write

r2
 
1
fR,Θ (r, θ) = r exp − I{r > 0}I{0 < θ < 2π}.
2πσ 2 2σ 2

The right hand side above clearly factorizes into the product of a function depending on r
alone and a function depending on θ alone. This implies that R and Θ are independent. The
marginal distribution of Θ is given by integrating over r:
Z ∞
r2
 
1
fΘ (θ) = I{0 < θ < 2π} r exp − dr
0 2πσ 2 2σ 2
r2
The substitution s = 2σ 2
(so ds = rdr/σ 2 ) leadds to

e−s
Z
I{0 < θ < 2π}
fΘ (θ) = I{0 < θ < 2π} ds = .
0 2π 2π

This means that Θ is uniformly distributed over (0, 2π). We have thus proved that all
the three conditions of Fact 12.4 are satisfied when X, Y are independently distributed as
N (0, σ 2 ).

We shall now prove Fact 12.4 by showing that X, Y being N (0, σ 2 ) is the only way all the
three conditions are satisfied.

Proof of Fact 12.4. We shall work with (57) which connects the joint density of (X, Y ) to
the joint density of (R, Θ). Because X and Y are assumed to be independent and identically
distributed, the left hand side of (57) becomes

fX,Y (x, y) = fX (x)fY (y) = f (x)f (y)

64
where f is the common density of X and Y . On the other hand, because R and Θ are
independent and Θ ∼ Unif(0, 2π), the right hand side of (57) becomes
1 p 1 p
p fR,Θ ( x2 + y 2 , θ(x, y)) = p fR ( x2 + y 2 )fΘ (θ(x, y))
x2 + y 2 x2 + y 2
1 p 1
=p fR ( x2 + y 2 ) .
2
x +y 2 2π

We thus obtain
1 p 1
f (x)f (y) = p fR ( x2 + y 2 ) (58)
x2 + y 2 2π
for every −∞ < x, y < ∞. Plugging in y = 0 above, we obtain
1 1
fR (|x|) = f (x)f (0) for every −∞ < x < ∞.
|x| 2π
p
Plugging in x2 + y 2 in place of x above, we obtain
1 p 1 p
p fR ( x2 + y 2 ) = f ( x2 + y 2 )f (0).
x2 + y 2 2π

Combining the above identity with (58), we deduce


p
f (x)f (y) = f ( x2 + y 2 )f (0) for every −∞ < x, y < ∞. (59)

This identity implies that f is a symmetric function (i.e., f (x) = f (−x) = f (|x|)) because if
we take y = 0, we get f (x)f (0) = f (|x|)f (0) or f (x) = f (|x|).

Let h : [0, ∞) → [0, ∞) be defined by



h(u) = f ( u) for every u ≥ 0.

Then (59) implies

h(u)h(v) = h(u + v)h(0) for every u, v ≥ 0,

or equivalently
h(u + v) h(u) h(v)
= for every u, v ≥ 0.
h(0) h(0) h(0)
This implies that for every nonnegative integer m and u ≥ 0,
 m
h(mu) h(u + · · · + u) h(u) h(u) h(u)
= = ... = (60)
h(0) h(0) h(0) h(0) h(0)
Two consequences of the above are:
 m
h(mu/n) h(u/n)
=
h(0) h(0)
which is obtained by replacing u by u/n in (60), and

h(u) 1/n
 
h(u/n)
=
h(0) h(0)
which is obtained by replacing u by u/n in (60) and taking m = n. Combining the above
two equations, we obtain

h(u/n) m h(u) m/n


   
h(mu/n)
= = .
h(0) h(0) h(0)

65
As m and n are nonnegative integers, we have deduced (take u = 1)
h(1) x
 
h(x)
=
h(0) h(0)
whenever x ≥ 0 is a rational number (i.e., of the form m/n for some integers m and n). If
we now assume that h is continuous, we can deduce the above for every x ≥ 0. We have
thus proved that  
h(1)
h(x) = h(0) exp x log = c exp(xb)
h(0)

for some constants c (c = h(0)) and b (b = log h(1)
h(0) ). As f ( u) = h(u) (and f is symmetric),
we get √
f (± u) = c exp(ub).
We have thus proved that

f (x) = c exp x2 b

for every −∞ < x < ∞.

As f needs to be a valid density, we must have b < 0 so we can write b = − 2σ1 2 for some
1
σ > 0. This will necessarily imply that c = √2πσ leading to f being the N (0, σ 2 ) density.
This completes the proof of Fact 12.4.

13 Lecture Thirteen

13.1 Joint Density under Transformations

Let (X, Y ) have joint density fX,Y . We transform (X, Y ) to two new random variables
(U, V ) via (U, V ) = T (X, Y ). Suppose that T is invertible (having an inverse S = T −1 ) and
differentiable. In the last class, we saw the following formula relating the joint density of
(U, V ) to fX,Y :
fU,V (u, v) = fX,Y (S(u, v)) |det JS (u, v)| . (61)
Let us start today by working out the following simple application of the formula (61).

Example 13.1. Suppose X and Y are independent random variables with

X ∼ Gamma(α1 , λ) and Y ∼ Gamma(α2 , λ).

Note the the rate parameter is the same in both the Gamma distributions. Now define
X
U := X + Y and V := .
X +Y
What is the joint density of U and V ? Here the transformation T is given by T (x, y) =
(x + y, x/(x + y)) and its inverse transformation can be checked to be S(u, v) = (uv, u(1−v)).
The formula (61) then gives that for every u > 0 and 0 < v < 1 (we are taking u > 0 because
the random variable U is always positive and V is between 0 and 1):

fU,V (u, v) = fX,Y (uv, u(1 − v))u = fX (uv)fY (u(1 − v))u.

Plugging in the relevant Gamma densities for fX and fY , we can deduce that
λα1 +α2 Γ(α1 + α2 ) α1 −1
fU,V (u, v) = uα1 +α2 −1 e−λu I{u > 0} v (1 − v)α2 −1 I{0 < v < 1}.
Γ(α1 + α2 ) Γ(α1 )Γ(α2 )
(62)

66
This implies that U ∼ Gamma(α1 + α2 , λ). (62) also implies that the density of V is
Γ(α1 + α2 ) α1 −1
fV (v) = v (1 − v)α2 −1 I{0 < v < 1}.
Γ(α1 )Γ(α2 )
The above density is known as the Beta density with parameters α1 and α2 : Beta(α1 , α2 ).
Using the notation
Z 1
Γ(α1 )Γ(α2 )
B(α1 , α2 ) = = v α1 −1 (1 − v)α2 −1 dv, (63)
Γ(α1 + α2 ) 0

the Beta density can also be written as


1
fV (v) = v α1 −1 (1 − v)α2 −1 I{0 < v < 1}.
B(α1 , α2 )
The name “Beta density” is derived from the Beta function which is the name given to the
function (63) in mathematics.

One of the conclusions of the above example is that if X1 ∼ Gamma(α1 , λ) and X2 ∼


Gamma(α2 , λ) are independent, then

X1 + X2 ∼ Gamma(α1 + α2 , λ). (64)

More generally, if X1 , . . . , Xn are independent random variables with Xi ∼ Gamma(αi , λ),


then
X1 + · · · + Xn ∼ Gamma(α1 + · · · + αn , λ). (65)
(65) can be proved from (64) by, for example, mathematical induction over n. One conse-
quence of (65) is that if Z1 , . . . , Zn are independent random variables having the standard
normal distribution, then
 
n 1
Z12 + · · · + Zn2 ∼ Gamma , .
2 2
This is because, as we saw earlier in Lecture 10 (see Example 1.2 of Lecture 10), the square
of a normal random variable has the Gamma(1/2, 1/2) distribution. The distribution of the
sum of squares of n independent normal random variables is also known as the chi-squared
distribution with n degrees of freedom: χ2n . We thus have
 
2 n 1
χn = Gamma , .
2 2

13.2 Conditional Densities for Continuous Random Variables

Consider two random variables X and Y with joint density fX,Y . How do we calculate the
conditional probability:
P {X ∈ A | Y = y0 } (66)
for some subset A ⊆ R and y0 ∈ R. The naive way to calculate the above probability is to
write it as
P {X ∈ A | Y = y0 }
P {X ∈ A | Y = y0 } = .
P{Y = y0 }
The denominator on the right hand side above equals 0 because Y is a continuous random
variable. As a result, the numerator is also equal zero. Thus the right hand side equals 00
and hence undefined.

67
The proper way to define (66) is to think of the conditioning event Y = y0 as y0 − ϵ/2 ≤
Y ≤ y0 + ϵ/2 for some small ϵ. We then have

P {X ∈ A | Y = y0 } ≈ P {X ∈ A | y0 − ϵ/2 ≤ Y ≤ y0 + ϵ/2}
P {X ∈ A, y0 − ϵ/2 ≤ Y ≤ y0 + ϵ/2}
=
P {y0 − ϵ/2 ≤ Y ≤ y0 + ϵ/2}
R R y0 +ϵ/2
A y0 −ϵ/2 fX,Y (x, y)dxdy
= R y0 +ϵ/2
y0 −ϵ/2 fY (y)dy
R
A (fX,Y (x, y0 ) × ϵ) dx
Z  
fX,Y (x, y0 )
≈ = dx.
fY (y0 ) × ϵ A fY (y0 )
Motivated by the above calculation, we define the conditional density of X given Y = y as
fX,Y (x, y)
fX|Y =y (x) := . (67)
fY (y)
This is well-defined as long as fY (y) > 0. The result we just derived can also be written as
Z
P{X ∈ A | Y = y0 } = fX|Y =y (x)dx.
A

Here are important facts about conditional densities.

13.2.1 Conditional Density is Proportional to Joint Density

As a function of x (and keeping y fixed), fX|Y =y (x) is a valid density i.e.,


Z ∞
fX|Y =y (x) ≥ 0 for every x and fX|Y =y (x)dx = 1.
−∞

The integral above equals one because


Z ∞ Z ∞ R∞
fX,Y (x, y) fX,Y (x, y)dx fY (y)
fX|Y =y (x)dx = dx = −∞ = = 1.
−∞ −∞ fY (y) fY (y) fY (y)
Because fX|Y =y (x) integrates to one as a function of x and because the denominator fY (y)
in the definition (67) does not depend on x, it is common to write

fX|Y =y (x) ∝ fX,Y (x, y). (68)

The symbol ∝ here stands for “proportional to” and the above statement means that
fX|Y =y (x), as a function of x, is proportional to fX,Y (x, y). The proportionality constant
then has to be fY (y) because that is equal to the value of the integral of fX,Y (x, y) as x
ranges over (−∞, ∞).

The proportionality statement (68) often makes calculations involving conditional densities
much simpler.

13.2.2 Conditional Densities and Independence

X and Y are independent if and only if fX|Y =y = fX for every value of y such that fY (y) > 0.
This latter statement is precisely equivalent to fX,Y (x, y) = fX (x)fY (y). By switching roles
of X and Y , it also follows that X and Y are independent if and only if fY |X=x = fY for
every x such that fX (x) > 0.

68
13.2.3 Law of Total Probability for Continuous Random Variables

Note first that from the definition of fX|Y =y (x), it directly follows that

fX,Y (x, y) = fX|Y =y (x)fY (y).

This tells us how to compute the joint density of X and Y using knowledge of the marginal
of Y and the conditional density of X given Y .

From here (and the fact that integrating the joint density with respect to one of the
variables gives the marginal density of the other random variable), it is easy to derive the
formula Z
fX (x) = fX|Y =y (x)fY (y)dy. (69)

This formula, known as the Law of Total Probability, allows us to deduce the marginal
density of X using knowledge of the conditional density of X given Y and the marginal
density of Y .

Here are two applications of the Law of Total Probability.

Example 13.2. Suppose X and Y are independent standard normal random variables. What
is the density of U = X/Y ?

Using the Law of Total Probability, we get


Z ∞ Z ∞
fU (u) = fU |Y =v (u)fY (v)dv = f X |Y =v (u)fY (v)dv.
Y
−∞ −∞

d d
Now (below = stands for equality in distribution: A=B means that the random variables A
and B have the same distribution)
X d X d X
|Y =v = |Y =v =
Y v v
where the last equality follows because X and Y are independent. We thus get
Z ∞ Z ∞
fU (u) = f X |Y =v (u)fY (v)dv = f X (u)fY (v)dv.
Y v
−∞ −∞

By the change of variable formula in the univariate case, we get


d
f X (u) = fX (uv) (uv) = fX (uv)|v|.
v du
Thus
Z ∞
fU (u) = fX (uv)|v|fY (v)dv
−∞
Z ∞  2 2  2
1 u v 1 v
= √ exp − √ exp − |v|dv
−∞ 2π 2 2π 2
Z ∞
(1 + u2 )v 2
 
1
= exp − |v|dv
−∞ 2π 2
Z ∞
(1 + u2 )v 2
 
1 1
=2 exp − vdv = .
0 2π 2 π(1 + u2 )

The last equality is derived by the change of variable w = v 2 /2 to evaluate the integral. We
have therefore proved that U has the Cauchy density.

69
The Cauchy density is a special case of the t-density when the degrees of freedom is equal
to one (i.e., the t-density with one degree of freedom is the same as the Cauchy density).
The t-density for n degrees of freedom can also be derived as a consequence of the law of
total probability (this is done in the next example).

Example 13.3. Suppose Z, X1 , . . . , Xn are independent random variables all having the
standard normal distribution. The distribution of the random variable
Z
T := q
X12 +...Xn2
n

is said to be the t-distribution with n degrees of freedom. Its density can be calculated using
the Law of Total Probability as shown below. First let
 
2 2 2 n 1
V := X1 + · · · + Xn and note that V ∼ χn = Gamma , .
2 2

As a result

fT (t) = f √ Z (t)
V /n
Z ∞
= f √ Z |V =v (t)fV (v)dv
0 V /n
Z ∞
= f √Z |V =v (t)fV (v)dv
0 v/n
Z ∞
= f √Z (t)fV (v)dv
0 v/n
Z ∞  r r
v v
= fZ t fV (v)dv
0 n n
Z ∞
vt2 v (1/2)n/2 (n/2)−1 −v/2
 r
1
= √ exp − v e dv
0 2π 2n n Γ(n/2)
1 (1/2)n/2 ∞ t2
Z   
v
=√ exp − 1+ v (n/2)−(1/2) dv.
2πn Γ(n/2) 0 2 n

The integrand in the integral above is equal to the main part of the Gamma(α, λ) density
with
t2
 
n+1 1
α= and λ = 1+ .
2 2 n

Thus the value of the integral is simply the normalization constant of the Gamma density:

∞ − n+1
t2 t2
Z      
v (n/2)−(1/2) Γ(α) n+1 (n+1)/2
2
exp − 1+ v dv = α = Γ 2 1+ .
0 2 n λ 2 n

We have thus proved:


− n+1
1 (1/2)n/2 t2
  
n+1 (n+1)/2
2
fT (t) = √ Γ 2 1+
2πn Γ(n/2) 2 n
n+1
n+1
  −
t2

Γ 2 2
=√ n
 1+ .
πnΓ 2 n

70
 −(n+1)/2
t2
This density, which is proportional to 1+ n , is the t-density with n degrees of
freedom. When n = 1, this density if proportional to (1 + t2 )−1 so the t-density with 1 degree
of freedom is exactly equal to the Cauchy density. When n becomes large, the tails of the
t-density become less heavy and it eventually becomes the standard normal density. Indeed,
when n is large, we can write (for each fixed u)
− n+1
t2
  
2n + 1
2 2
1+ ≈ exp −t ≈ e−t /2 .
n 2n

13.2.4 Bayes Rule for Continuous Random Variables

A direct consequence of the definition of the conditional density is:


fX|Y =y (x)fY (y) fX|Y =y (x)fY (y)
fY |X=x (y) = =R .
fX (x) fX|Y =u (x)fY (u)du

This is the Bayes rule and it is useful for calculating the conditional density of Y given X = x
from knowledge of the conditional density of X given Y = y (and the marginal density of
Y ). We shall see many applications of this rule in the next few lectures.

14 Lecture Fourteen

14.1 Recap: Last Class

In the last class, we looked at conditional densities for continuous random variables. Given
two continuous random variables X and Y having a joint density fX,Y (x, y), the conditional
density of X given Y = y is defined as
fX,Y (x, y)
fX|Y =y (x) := . (70)
fY (y)

This definition makes sense when fY (y) > 0. Also when fY (y) > 0, the above is a valid
density in x i.e., fX|Y =y (x) ≥ 0 for all x and
Z ∞
fX|Y =y (x)dx = 1.
−∞

Using the conditional density, the conditional probabilities involving X given Y = y are
calculated as: Z
P {X ∈ A | Y = y} := fX|Y =y (x)dx.
A

14.2 Law of Total Probability (LTP) and Bayes Rule for Continuous Variables

Conditional densities are used all over the place in probability and statistics. They are
particulary useful while making probability assignments in Bayesian analyses. Here it is
convenient to denote the two random variables by Θ and X (as opposed to X and Y ).
Θ typically denotes an unobserved parameter while X denotes observed data. A Bayesian
analysis starts by making an assignment for the probability distribution of Θ and X. Directly
modeling the joint density is usually difficult. One therefore models the marginal distribution

71
of Θ (this is called the prior distribution) and the conditional distribution of X given Θ = θ
(this is called the likelihood):

fΘ (θ) and fX|Θ=θ (x).

From these, the joint density can be written as

fΘ,X (θ, x) = fθ (θ)fX|Θ (x).

This completely specifies the joint probability distribution of Θ and X. Unspecified quantities
such as the marginal distribution of X and the conditional distribution of Θ given X = x
are then calculated by using the rules of probability.

For calculating the marginal distribution of X, we use the law of total probability:
Z
fX (x) = fX|Θ=θ (x)fΘ (θ)dθ.

For calculating the conditional distribution of Θ given X = x, we use the Bayes rule:
fX|Θ=θ (x)fΘ (θ)
fΘ|X=x (θ) =
fX (x)
fX|Θ=θ (x)fΘ (θ)
=R
fX|Θ=θ (x)fΘ (θ)dθ
∝ fX|Θ=θ (x)fΘ (θ).

In the statistical context, we shall refer to fΘ|X=x (·) as the posterior density of Θ and fX (·)
as the Evidence.

Here are a simple application of these formulae. More interesting examples will be studied
later.

Example 14.1. Suppose Θ ∼ N (µ, τ 2 ) and X|Θ = θ ∼ N (θ, σ 2 ). Then


x/σ 2 + µ/τ 2
 
2 2 1
X ∼ N (µ, τ + σ ) and Θ|X = x ∼ N , . (71)
1/σ 2 + 1/τ 2 1/σ 2 + 1/τ 2
I will sketch the proof of the above results below. The intuition behind the posterior distri-
bution is as follows. For a normal density with mean m and variance v 2 , the inverse of the
variance 1/v 2 is called the precision. Skinnier normal distributions have high precision and
vice versa.

The formula for the posterior distribution given above implies that the precision of the
conditional distribution of Θ given X = x equals the sum of the precisions of the distribu-
tion of Θ and the distribution of X respectively which means that the posterior is skinnier
compared to the prior and the likelihood normal distributions. Also the mean of the posterior
distribution equals a weighted linear combination of the prior mean and the data with the
weights being proportional to the precisions.

To derive the first part of (71), we use the LTP:


Z
fX (x) = fX|Θ=θ (x)fΘ (θ)dθ

Now
1 (θ − µ)2 (x − θ)2
  
1
fX|Θ=θ (x)fΘ (θ) = exp − +
2πτ σ 2 τ2 σ2

72
The term in the exponent above can be simplified as

(θ − µ)2 (x − θ)2 x  µ2 x2
 
1 1 2

2
+ 2
= 2
+ 2
θ − 2θ 2
+ 2
+ 2 + 2
τ σ τ σ τ σ τ σ
2
x/σ 2 + µ/τ 2 (x − µ)2
  
1 1
= + θ − +
τ 2 σ2 1/σ 2 + 1/τ 2 τ 2 + σ2

where I skipped a few steps to get to the last equality (complete the square and simplify the
resulting expressions).

As a result
2 !
x/σ 2 + µ/τ 2 (x − µ)2
   
1 1 1 1
fX|Θ=θ (x)fΘ (θ) = exp − + θ− exp −
2πτ σ 2 τ 2 σ2 1/σ 2 + 1/τ 2 2(τ 2 + σ 2 )

Consequently,
2 !
x/σ 2 + µ/τ 2 (x − µ)2
Z    
1 1 1 1
fX (x) = exp − + θ − exp − dθ
2πτ σ 2 τ 2 σ2 1/σ 2 + 1/τ 2 2(τ 2 + σ 2 )
2 !
(x − µ)2 x/σ 2 + µ/τ 2
 Z  
1 1 1 1
= exp − exp − + θ− dθ
2πτ σ 2(τ 2 + σ 2 ) 2 τ 2 σ2 1/σ 2 + 1/τ 2
(x − µ)2 √ 1 −1/2
   
1 1
= exp − 2π +
2πτ σ 2(τ 2 + σ 2 ) τ 2 σ2
(x − µ)2
 
1
= exp −
2(τ 2 + σ 2 )
p
2π(τ 2 + σ 2 )

which gives

X ∼ N (0, τ 2 + σ 2 ).

For the posterior distribution in (71), we use the Bayes rule (and the above derived expres-
sions for fX|Θ=θ (x)fΘ (θ) and fX (x)):
√ 2 !
fX|Θ=θ (x)fΘ (θ) τ 2 + σ2 x/σ 2 + µ/τ 2
 
1 1 1
fΘ|X=x (θ) = = √ exp − + θ−
fX (x) 2πτ 2 σ 2 2 τ 2 σ2 1/σ 2 + 1/τ 2

which immediately implies:

x/σ 2 + µ/τ 2
 
1
Θ|X = x ∼ N , .
1/σ 2 + 1/τ 2 1/σ 2 + 1/τ 2

14.3 LTP and Bayes Rule for general random variables

The LTP describes how to compute the distribution of X based on knowledge of the condi-
tional distribution of X given Θ = θ as well as the marginal distribution of Θ. The Bayes
rule describes how to compute the conditional distribution of Θ given X = x based on the
same knowledge of the conditional distribution of X given Θ = θ as well as the marginal
distribution of Θ.

We have so far looked at the LTP and Bayes rule when X and Θ are both discrete or when
they are both continuous. Now we shall also consider the cases when one of them is discrete
and the other is continuous.

73
14.3.1 X and Θ are both discrete

The LTP is X
P{X = x} = P{X = x|Θ = θ}P{Θ = θ}
θ

and the Bayes rule is

P{X = x|Θ = θ}P{Θ = θ} P{X = x|Θ = θ}P{Θ = θ}


P{Θ = θ|X = x} = =P .
P{X = x} θ P{X = x|Θ = θ}P{Θ = θ}

14.3.2 X and Θ are both continuous

Here LTP is Z
fX (x) = fX|Θ=θ (x)fΘ (θ)dθ

and Bayes rule is

fX|Θ=θ (x)fΘ (θ) fX|Θ=θ (x)fΘ (θ)


fΘ|X=x (θ) = =R .
fX (x) fX|Θ=θ (x)fΘ (θ)dx

14.3.3 X is discrete while Θ is continuous

LTP is Z
P{X = x} = P{X = x|Θ = θ}fΘ (θ)dθ

and Bayes rule is

P{X = x|Θ = θ}fΘ (θ) P{X = x|Θ = θ}fΘ (θ)


fΘ|X=x (θ) = =R .
P{X = x} P{X = x|Θ = θ}fΘ (θ)dθ

14.3.4 X is continuous while Θ is discrete

LTP is X
fX (x) = fX|Θ=θ (x)P{Θ = θ}
θ

and Bayes rule is

fX|Θ=θ (x)P{Θ = θ} fX|Θ=θ (x)P{Θ = θ}


P{Θ = θ|X = x} = =P
fX (x) θ fX|Θ=θ (x)P{Θ = θ}

These formulae are useful when the conditional distribution of X given Θ = θ as well as
the marginal distribution of Θ are given as part of the model specification and the goal is to
determine the marginal distribution of X as well as the conditional distribution of Θ given
X = x.

The following is an example of the LTP and Bayes Rule when Θ is continuous and X is
discrete.

74
Example 14.2. Suppose that Θ has the Beta(α, β) distribution on (0, 1) and let X|Θ = θ
has the binomial distribution with parameters n and θ. What then is the marginal distribution
of X as well as the conditional distribution of Θ given X = x?

Note that this is a situation where X is discrete (taking values in 0, 1, . . . , n) and Θ is


continuous (taking values in the interval (0, 1)). To compute the marginal distribution of X,
we use the appropriate LTP to write (for x = 0, 1, . . . , n)
Z
P{X = x} = P{X = x|Θ = θ}fΘ (θ)dθ
Z 1 
n x θα−1 (1 − θ)β−1
= θ (1 − θ)n−x dθ
0 x B(α, β)
 
n B(x + α, n − x + β)
= .
x B(α, β)
Let us now calculate the posterior distribution of Θ given X = x. Using the Bayes rule, we
obtain
fΘ|X=x (θ) ∝ P{X = x|Θ = θ}fΘ (θ) ∝ θx (1 − θ)n−x θα−1 (1 − θ)β−1 I{0 < θ < 1}
= θx+α−1 (1 − θ)n−x+β−1 I{0 < θ < 1}.
Thus
Θ|X = x ∼ Beta(x + α, n − x + β).

15 Lecture Fifteen

15.1 LTP and Bayes Rule for general random variables

The LTP describes how to compute the distribution of X based on knowledge of the condi-
tional distribution of X given Θ = θ as well as the marginal distribution of Θ. The Bayes
rule describes how to compute the conditional distribution of Θ given X = x based on the
same knowledge of the conditional distribution of X given Θ = θ as well as the marginal
distribution of Θ.

The precise formulae for the LTP and Bayes Rule can be broken down into four cases
according to whether X and/or Θ is discrete or continuous.

15.1.1 X and Θ are both discrete

The LTP is X
P{X = x} = P{X = x|Θ = θ}P{Θ = θ}
θ
and the Bayes rule is
P{X = x|Θ = θ}P{Θ = θ} P{X = x|Θ = θ}P{Θ = θ}
P{Θ = θ|X = x} = =P .
P{X = x} θ P{X = x|Θ = θ}P{Θ = θ}

15.1.2 X and Θ are both continuous

Here LTP is Z
fX (x) = fX|Θ=θ (x)fΘ (θ)dθ (72)

75
and Bayes rule is
fX|Θ=θ (x)fΘ (θ) fX|Θ=θ (x)fΘ (θ)
fΘ|X=x (θ) = =R .
fX (x) fX|Θ=θ (x)fΘ (θ)dx

15.1.3 X is discrete while Θ is continuous

LTP is Z
P{X = x} = P{X = x|Θ = θ}fΘ (θ)dθ

and Bayes rule is


P{X = x|Θ = θ}fΘ (θ) P{X = x|Θ = θ}fΘ (θ)
fΘ|X=x (θ) = =R .
P{X = x} P{X = x|Θ = θ}fΘ (θ)dθ

15.1.4 X is continuous while Θ is discrete

LTP is X
fX (x) = fX|Θ=θ (x)P{Θ = θ} (73)
θ
and Bayes rule is
fX|Θ=θ (x)P{Θ = θ} fX|Θ=θ (x)P{Θ = θ}
P{Θ = θ|X = x} = =P (74)
fX (x) θ fX|Θ=θ (x)P{Θ = θ}

These formulae are useful when the conditional distribution of X given Θ = θ as well as
the marginal distribution of Θ are given as part of the model specification and the goal is to
determine the marginal distribution of X as well as the conditional distribution of Θ given
X = x.

We shall look at some more applications of these formulae today.

15.2 A Simple Model Selection Application

Suppose Θ has the Ber(0.5) distribution i.e.,

P{Θ = 0} = P{Θ = 1} = 0.5.

Next assume that X1 , . . . , Xn have the following distributions conditional on Θ = θ:


 2
i.i.d 1 x
X1 , . . . , Xn | Θ = 0 ∼ f0 where f0 (x) := √ exp −
2π 2
and r !
i.i.d 1 2
X1 , . . . , Xn | Θ = 1 ∼ f1 where f1 (x) := √ exp −|x| .
2π π
f0 is the standard normal√density and f1 is a Laplace density. Both densities have the
same maximal value of 1/ 2π. Based on the information given, calculate the conditional
distribution of Θ given X1 = x1 , X2 = x2 , . . . , X6 = x6 (i.e., n = 6) where

x1 = −0.55, x2 = −1.11, x3 = 1.23, x4 = 0.29, x5 = 1.56, x6 = −1.64. (75)

76
Here is the context for this question. We observe data x1 , . . . , xn with n = 6. We want to
use one of the models f0 or f1 for this data. The random variable Θ is used to describe the
choice of the model. We want to treat both the models on an equal footing so we assumed
that Θ has the uniform prior distribution on {0, 1}.

To calculate the conditional distribution of Θ given the data, we use the formula (74)
because Θ is discrete and the data X1 , . . . , Xn are continuous. This gives

P{Θ = 0 | X1 = x1 , . . . , Xn = xn }
fX1 ,...,Xn |Θ=0 (x1 , . . . , xn )P{Θ = 0}
=
fX1 ,...,Xn |Θ=0 (x1 , . . . , xn )P{Θ = 0} + fX1 ,...,Xn |Θ=1 (x1 , . . . , xn )P{Θ = 1}
1
fX1 ,...,Xn |Θ=0 (x1 , . . . , xn ) × 2
= 1 1
fX1 ,...,Xn |Θ=0 (x1 , . . . , xn ) × 2 + fX1 ,...,Xn |Θ=1 (x1 , . . . , xn ) × 2
fX1 ,...,Xn |Θ=0 (x1 , . . . , xn )
=
fX1 ,...,Xn |Θ=0 (x1 , . . . , xn ) + fX1 ,...,Xn |Θ=1 (x1 , . . . , xn )
f0 (x1 )f0 (x2 ) . . . f0 (xn )
=
f0 (x1 )f0 (x2 ) . . . f0 (xn ) + f1 (x1 )f1 (x2 ) . . . f1 (xn )
 n
√1 1 Pn 2


exp − 2 i=1 x i
= n Pn   n  q P .
1 1 2 + √1 2 n


exp − 2 i=1 ix 2π
exp − π |x
i=1 i |

Similarly

P{Θ = 0 | X1 = x1 , . . . , Xn = xn }
 n  q P 
√1 2 n

exp − π |x
i=1 i |
= n  n  q P .
√1 1 P n 2 + √1
 2 n

exp − 2 x
i=1 i 2π
exp − π |x
i=1 i |

Plugging in the above formula the data values given in (75) for x1 , . . . , x6 , we obtain

P{Θ = 0 | X1 = x1 , . . . , X6 = x6 } = 0.72 and P{Θ = 1 | X1 = x1 , . . . , X6 = x6 } = 0.28

Thus, conditioning on the data, we have a 72% probability for the normal model compared
to 28% probability for the Laplace model. Now suppose that we add in an additional
observation x7 = 5. It can be checked that

P{Θ = 0 | X1 = x1 , . . . , X7 = x7 } = 0.001 and P{Θ = 1 | X1 = x1 , . . . , X7 = x7 } = 0.999

Now there is overwhelming preference for the Laplace model. This is because x7 = 5 is an
outlying observation to which the Laplace model gives much higher probability compared to
the Normal model owing to heavy tails of the Laplace density.

15.3 Model Selection with unknown parameters

Suppose Θ has the Ber(0.5) distribution i.e.,

P{Θ = 0} = P{Θ = 1} = 0.5.

Next assume that X1 , . . . , Xn have the following distributions conditional on Θ = θ:


i.i.d
X1 , . . . , Xn | Θ = 0 ∼ N (0, σ02 ) for some σ0 > 0

77
and
i.i.d
X1 , . . . , Xn | Θ = 1 ∼ Lap(0, σ1 ) for some σ1 > 0.
Here Lap(0, σ1 ) denotes the Laplace density centered at 0 and having scale σ1 ; its density is
given by  
1 |x|
x 7→ exp − .
2σ1 σ1
Based on this information, calculate the conditional distribution of Θ given X1 = x1 , . . . , Xn =
xn where n = 10 and x1 , . . . , xn are given by

−0.69, −4.26, 0.14, −0.86, 0.42, 24.21, 0.51, −1.23, 2.30, 4.15. (76)

Once again, this is a model selection problem where we need to choose between the normal
model and the Laplace model based on the observed data given above. We can proceed
exactly as in the last section and write down the conditional probabilities of Θ given X1 =
x1 , . . . , Xn = xn . However the answers would depend on σ0 and σ1 . We would not be able
to make a decision between the two models because of this annoying dependence on σ0 , σ1 .
To get rid of the dependence on the specific values of σ0 , σ1 , a natural strategy is to treat
σ0 and σ1 as unknown parameters and further make distributional assumptions on them to
reflect our ignorance of their precise values. One way of doing this is to assume that:
i.i.d
log σ0 | Θ = 0 ∼ Unif(−C, C) and X1 , . . . , Xn | σ0 , Θ = 0 ∼ N (0, σ02 ) (77)

as well as
i.i.d
log σ1 | Θ = 1 ∼ Unif(−C, C) and X1 , . . . , Xn | σ1 , Θ = 1 ∼ Lap(0, σ1 ) (78)

for a large constant value C. In other words, we are using the uniform distribution on
(−C, C) to reflect our ignorance of log σ0 and log σ1 . We can now calculate the conditional
distribution of Θ given X1 = x1 , . . . , Xn = xn in the following way. As in the previous
section, we first obtain

P{Θ = 0 | X1 = x1 , . . . , Xn = xn }
fX1 ,...,Xn |Θ=0 (x1 , . . . , xn )P{Θ = 0}
=
fX1 ,...,Xn |Θ=0 (x1 , . . . , xn )P{Θ = 0} + fX1 ,...,Xn |Θ=1 (x1 , . . . , xn )P{Θ = 1}
1
fX1 ,...,Xn |Θ=0 (x1 , . . . , xn ) × 2
= 1 1
fX1 ,...,Xn |Θ=0 (x1 , . . . , xn ) × 2 + fX1 ,...,Xn |Θ=1 (x1 , . . . , xn ) × 2
fX1 ,...,Xn |Θ=0 (x1 , . . . , xn )
= (79)
fX1 ,...,Xn |Θ=0 (x1 , . . . , xn ) + fX1 ,...,Xn |Θ=1 (x1 , . . . , xn )

and similarly

fX1 ,...,Xn |Θ=1 (x1 , . . . , xn )


P{Θ = 1 | X1 = x1 , . . . , Xn = xn } = .
fX1 ,...,Xn |Θ=0 (x1 , . . . , xn ) + fX1 ,...,Xn |Θ=1 (x1 , . . . , xn )
(80)

We therefore need to calculate

fX1 ,...,Xn |Θ=0 (x1 , . . . , xn ) and fX1 ,...,Xn |Θ=1 (x1 , . . . , xn )

These densities are not directly given to us (unlike in the problem of the previous section)
but they are given conditionally on the parameters σ0 and σ1 . We shall therefore calculate

78
them using the Law of Total Probability (72) (note that X1 , . . . , Xn as well as σ0 , σ1 are all
continuous parameters). We thus have
Z
fX1 ,...,Xn |Θ=0 (x1 , . . . , xn ) = fX1 ,...,Xn |Θ=0,σ0 (x1 , . . . , xn )fσ0 |Θ=0 (σ0 )dσ0
n
Z "Y #
= fXi |Θ=0,σ0 (xi ) fσ0 |Θ=0 (σ0 )dσ0
i=1
n
Z "Y #
x2i

1
= √ exp − 2 fσ0 |Θ=0 (σ0 )dσ0
i=1
2πσ0 2σ0
n Z  Pn
x2i
 
1 −n i=1
= √ σ0 exp − fσ0 |Θ=0 (σ0 )dσ0 .
2π 2σ02

Because we assumed that log σ0 has the uniform distribution on (−C, C) (conditional on
Θ = 0), we get

d 1 I{−C < log σ0 < C}


fσ0 |Θ=0 (σ0 ) = flog σ0 |Θ=0 (log σ0 ) (log σ0 ) = flog σ0 |Θ=0 (log σ0 ) = .
dσ0 σ0 2Cσ0

As a result
 n Z  Pn 2
1 i=1 xi I{−C < log σ0 < C}
fX1 ,...,Xn |Θ=0 (x1 , . . . , xn ) = √ σ0−n exp
− 2 dσ0
2π 2σ0 2Cσ0
 n Z eC  Pn 2

1 1 −n−1 i=1 xi
= σ exp − dσ0 .
2π 2C e−C 0 2σ02

Because C is large, the limits in the above integral will effectively be between 0 and ∞ (as
e−C ≈ 0 and eC ≈ ∞). Thus
n Z ∞  Pn
x2
 
1 1
fX1 ,...,Xn |Θ=0 (x1 , . . . , xn ) ≈ √ σ0−n−1 exp − i=12 i dσ0 .
2π 2C 0 2σ0

To evaluate the integral above, we use the change of variable


rP rP
Pn 2 n 2 n 2
 
i=1 xi i=1 xi −1/2 i=1 xi 1 −3/2
u= so that σ0 = u and dσ 0 = − u du
2σ02 2 2 2

which gives

fX1 ,...,Xn |Θ=0 (x1 , . . . , xn )


2 −(n+1)/2
n rP
Z ∞  Pn n
x2i 1 −3/2
  
1 1 i=1 x i (n+1)/2 −u i=1
= √ u e u du
2π 2C 0 2 2 2
n  Pn 2 −n/2
1 ∞ (n/2)−1 −u
 Z
1 1 i=1 xi
= √ u e du
2π 2C 2 2 0
n  Pn −n/2
x2i

1 1 i=1 1 n
= √ Γ .
2π 2C 2 2 2

79
Similarly
Z
fX1 ,...,Xn |Θ=1 (x1 , . . . , xn ) = fX1 ,...,Xn |Θ=1,σ1 (x1 , . . . , xn )fσ1 |Θ=1 (σ1 )dσ1
Z "Yn
#
= fXi |Θ=1,σ1 (xi ) fσ0 |Θ=1 (σ1 )dσ1
i=1
Z "Yn  #
1 |xi | I{−C < log σ1 < C}
= exp − dσ1
2σ1 σ1 2Cσ1
i=1
 n Z eC  Pn 
1 1 −n−1 i=1 |xi |
= σ exp − dσ1
2 2C e−C 1 σ1
 n Z ∞  Pn 
1 1 |xi |
≈ σ1−n−1 exp − i=1 dσ1
2 2C 0 σ1

The change of variable


Pn
i=1 |xi |
u=
σ
leads to

fX1 ,...,Xn |Θ=1 (x1 , . . . , xn )


 n n
!−n Z  n n
!−n

1 1 X n−1 −u 1 1 X
= |xi | u e du = |xi | Γ(n).
2 2C 0 2 2C
i=1 i=1

Plugging these expressions in (79) and (80), we obtain

P {Θ = 0 | X1 = x1 , . . . , Xn = xn }
n  Pn −n/2
x2i

√1 1 1 n
i=1

2π 2C 2 2Γ 2
= n  Pn −n/2
x2i 1 n 1
√1 1 1 n
( ni=1 |xi |)−n Γ(n)
i=1
  P
2π 2C 2 2Γ 2 + 2 2C
n  P n −n/2
x2i

√1 1 n
i=1

2π 2 2Γ 2
= n  P n −n/2
x2i 1 n Pn
√1 1 n
( i=1 |xi |)−n Γ(n)
i=1
 
2π 2 2Γ 2 + 2

and

P {Θ = 1 | X1 = x1 , . . . , Xn = xn }
1 n Pn
( i=1 |xi |)−n Γ(n)

2
= n  Pn 2 −n/2 .
√1 i=1 xi 1 n 1 n Pn −n
 
2π 2 2 Γ 2 + 2 ( |x
i=1 i |) Γ(n)

Now the observed data values in (76) can be plugged in to compute the posterior probabilities.
This gives

P {Θ = 0 | X1 = x1 , . . . , Xn = xn } = 0.008 and P {Θ = 1 | X1 = x1 , . . . , Xn = xn } = 0.992

Thus the observed data in (76) (which seems to contain some outliers) overwhelmingly favors
the Laplace model compared to the Normal model.

80
15.3.1 Considering one more model

Suppose now that Θ has the distribution given by:


1
P{Θ = 0} = P{Θ = 1} = P{Θ = 2} =
3
and that X1 , . . . , Xn have the following distributions conditional on Θ = θ:
i.i.d
X1 , . . . , Xn | Θ = 0 ∼ N (0, σ02 ) for some σ0 > 0

and
i.i.d
X1 , . . . , Xn | Θ = 1 ∼ Lap(0, σ1 ) for some σ1 > 0.
and
i.i.d
X1 , . . . , Xn | Θ = 2 ∼ C(0, σ2 ) for some σ2 > 0.
Here C(0, σ2 ) is the Cauchy density with location parameter 0 and scale parameter σ2 .
This density is given by π1 x2σ+σ 2
2. What then is the conditional distribution of Θ given
2
X1 = x1 , . . . , Xn = xn for the same data (76)?

This is basically the same problem as that considered in the previous section except that
we are considering the Cauchy model in addition to the normal and the Laplace models.

Using the Bayes rule, it is easy to see that

P {Θ = θ | X1 = x1 , . . . , Xn = xn }
fX1 ,...,Xn |Θ=θ (x1 , . . . , xn )
=
fX1 ,...,Xn |Θ=0 (x1 , . . . , xn ) + fX1 ,...,Xn |Θ=1 (x1 , . . . , xn ) + fX1 ,...,Xn |Θ=2 (x1 , . . . , xn )

for each θ = 0, 1, 2. The calculation for fX1 ,...,Xn |Θ=θ (x1 , . . . , xn ) for θ = 0 and θ = 1 is done
based on the assumptions (77) and (78), and this leads to exactly the same values as in the
previous section. More specifically
n  Pn −n/2
x2i

1 1 i=1 1 n
fX1 ,...,Xn |Θ=0 (x1 , . . . , xn ) = √ Γ .
2π 2C 2 2 2
and
 n n
!−n
1 1 X
fX1 ,...,Xn |Θ=1 (x1 , . . . , xn ) = |xi | Γ(n).
2 2C
i=1

For fX1 ,...,Xn |Θ=2 (x1 , . . . , xn ), we shall make the assumption (analogous to (77) and (78)):
i.i.d
log σ2 | Θ = 2 ∼ Unif(−C, C) and X1 , . . . , Xn | σ2 , Θ = 2 ∼ C(0, σ2 ).

This leads to
" n  #
Z eC Y 1σ2 1
fX1 ,...,Xn |Θ=2 (x1 , . . . , xn ) = 2 2 dσ2
e−C i=1 π xi + σ2 2Cσ2
Z ∞ "Yn  #
1 σ2 1
= 2 + σ2 dσ2 .
0 π x i 2 2Cσ 2
i=1

It is probably difficult to calculate this integral in closed form. But it is quite straightforward
to compute this numerically (in the code file, I computed this after the change of variable
t = log σ2 ).

81
For the dataset in (76), the above analysis leads to

 0.0003 for θ = 0
P {Θ = θ | X1 = x1 , . . . , Xn = xn } = 0.0417 for θ = 1
0.958 for θ = 2

Thus the Cauchy model has the highest probability. The Laplace model which received the
highest probability when only the two models (Normal and Laplace) were being considered
now gets low probability when we are also considering the Cauchy model.

This approach for Model Selection is often known as Bayesian Model Selection. An impor-
tant role in this approach is played by the quantities fX1 ,...,Xn |Θ=θ (x1 , . . . , xn ) for different
values of θ. In the Machine Learning literature, this quantity is known as the Evidence
for the model represented by Θ = θ in light of the observed data x1 , . . . , xn . When each
individual model contains additional parameters (such as in the present case where the ith
model is expressed in terms of the parameter σi ), the Evidence is calculated (via the Law
of Total Probability) as the integral of the probability for each value of the parameter with
respect to a prior on the parameter. Thus the Evidence for a model is also known as the
Integrated Likelihood of the model.

For more on Bayesian model selection using Evidences, see Jaynes [1, Chapter 20] or
MacKay [5, Chapter 28].

16 Lecture Sixteen

16.1 Conditional Expectation

Given two random variables X and Y , the conditional expectation (or conditional mean) of
Y given X = X is denoted by
E (Y |X = x)
and is defined as the expectation of the conditional distribution of Y given X = x.

We can write
 R
yf (y)dy : if Y is continuous
E (Y |X = x) = P Y |X=x
y yP{Y = y|X = x} : if Y is discrete

More generally
 R
g(y)fY |X=x (y)dy : if Y is continuous
E (g(Y )|X = x) = P
y g(y)P{Y = y|X = x} : if Y is discrete

and also
 R
g(x, y)fY |X=x (y)dy : if Y is continuous
E (g(X, Y )|X = x) = E (g(x, Y )|X = x) = P
y g(x, y)P{Y = y|X = x} : if Y is discrete

The most important fact about conditional expectation is the Law of Iterated Expecta-
tion (also known as the Law of Total Expectation). We shall see this next.

82
16.1.1 Law of Iterated/Total Expectation

The law of total expectation states that


 R
E (Y |X = x) fX (x)dx : if X is continuous
E(Y ) = P
x E (Y |X = x) P{X = x} : if X is discrete

Basically the law of total expectation tells us how to compute the expectation of E(Y ) using
knowledge of the conditional expectation of Y given X = x. Note the similarity to law
of total probability which specifies how to compute the marginal distribution of Y using
knowledge of the conditional distribution of Y given X = x.

The law of total expectation can be proved as a consequence of the law of total probability.
The proof when Y and X are continuous is given below. The proof in other cases (when one
or both of Y and X are discrete) is similar and left as an exercise.

Proof of the law of total expectation: Assume that Y and X are both continuous.
Then Z
E(Y ) = yfY (y)dy.

By the law of total probability, we have


Z
E(Y ) = yfY (y)dy
Z Z 
= y fY |X=x (y)fX (x)dx dy
Z Z  Z
= yfY |X=x (y)dy fX (x)dx = E(Y |X = x)fX (x)dx.

which proves the law of total expectation.

There is an alternate more succinct form of stating the law of total expectation which
justifies calling the law of iterated expectation. We shall see this next. Note that E(Y |X =
x) depends on x. In other words, E(Y |X = x) is a function of x. Let us denote this function
by h(·):
h(x) := E(Y |X = x).
If we now apply this function to the random variable X, we obtain a new random variable
h(X). This random variable is denoted by simply E(Y |X) i.e.,

E(Y |X) := h(X).

Note that when X is discrete, the expectation of this random variable E(Y |X) becomes
X X
E(E(Y |X)) = E(h(X)) = h(x)P{X = x} = E(Y |X = x)P{X = x}.
x x

And when X is continuous, the expectation of E(Y |X) is


Z Z
E(E(Y |X)) = E(h(X)) = h(x)fX (x)dx = E(Y |X = x)fX (x)dx.

Observe that the right hand sides in these expectations are precisely the terms on the right
hand side of the law of total expectation. Therefore the law of total expectation can be
rephrased as
E(Y ) = E(E(Y |X)).

83
Because there are two expectations on the right hand side, the law of total expectation is
also known as the Law of Iterated Expectation.

The law of iterated expection has many applications. A couple of simple examples are
given below following which we shall explore applications to risk minimization.

Example 16.1. Consider a stick of length ℓ. Break it at a random point X that is chosen
uniformly across the length of the stick. Then break the stick again at a random point Y that
is also chosen uniformly across the length of the stick. What is the expected length of the
final piece?

According to the description of the problem,

Y |X = x ∼ U nif (0, x) and X ∼ U nif (0, ℓ)

and we are required to calculate E(Y ). Note first that E(Y |X = x) = x/2 for every x which
means that E(Y |X) = X/2. Hence by the Law of Iterated Expectation,

E(Y ) = E(E(Y |X)) = E(X/2) = ℓ/4.

Example 16.2. Suppose X, Y, Z are i.i.d U nif (0, 1) random variables. Find the value of
P{X ≤ Y Z}?

By the Law of Iterated Expectation,

P{X ≤ Y Z} = E (I{X ≤ Y Z}) = E [E (I{X ≤ Y Z}|Y Z)] = E(Y Z) = E(Y )E(Z) = 1/4.

Example 16.3 (Sum of a random number of i.i.d random variables). Suppose X1 , X2 , . . .


are i.i.d random variables with E(Xi ) = µ. Suppose also that N is a discrete random variable
that takes values in {1, 2, . . . , } and that is independent of X1 , X2 , . . . . Define

S := X1 + X2 + · · · + XN .

In other words, S is the sum of a random number (N ) of the random variables Xi . The law
of iterated expectation can be used to compute the expectation of S as follows:

E(S) = E(E(S|N )) = E(N µ) = (µ)(EN ) = (EN )(EX1 ).

This fact is actually a special case of a general result called Wald’s identity.

16.1.2 Application of the Law of Total Expectation to Statistical Risk Minimization

The law of the iterated expectation has important applications to statistical risk minimization
problems. The simplest of these problems is the following.

Problem 1: Given two random variables X and Y , what is the function g ∗ (X) of X that
minimizes
R(g) := E (g(X) − Y )2
over all functions g? The resulting random variable g ∗ (X) can be called the Best Predictor
of Y as a function of X in terms of expected squared error.

To find g ∗ , we use the law of iterated expectation to write


n h io
R(g) = E (g(X) − Y )2 = E E (g(X) − Y )2 |X

84
The value g ∗ (x) which minimizes the inner expectation:
h i
E (Y − g(x))2 |X = x

is simply
g ∗ (x) = E(Y |X = x).
This is because E(Z − c)2 is minimized as c varies over R at c∗ = E(Z). We have thus proved
that the function g ∗ (X) which minimizes R(g) over all functions g is given by

g ∗ (X) = E(Y |X).

Thus the function of X which is closest to Y in terms of expected squared error is given by
the conditional mean E(Y |X).

Let us now consider a different risk minimization problem.

Problem 2: Given two random variables X and Y , what is the function g ∗ (X) of X that
minimizes
R(g) := E |g(X) − Y |
over all functions g? The resulting random variable g ∗ (X) can be called the Best Predictor
of Y as a function of X in terms of expected absolute error.

To find g ∗ we use the law of iterated expectation to write

R(g) = E |g(X) − Y | = E {E [|g(X) − Y | |X]}

The value g ∗ (x) which minimizes the inner expectation:

E [|Y − g(x)| |X = x]

is simply given by any conditional median of Y given X = x. This is because E|Z − c| is


minimized as c varies over R at any median of Z. To see this, assume that Z has a density
f and write
Z
E|Z − c| = |z − c|f (z)dz
Z c Z ∞
= (c − z)f (z)dz + (z − c)f (z)dz
−∞ c
Z c Z c Z ∞ Z ∞
=c f (z)dz − zf (z)dz + zf (z)dz − c f (z)dz.
−∞ −∞ c c

Differentiating with respect to c, we get


Z c Z ∞
d
E|Z − c| = f (z)dz − f (z)dz
dc −∞ c

Therefore when c is a median, the derivative of E|Z − c| will equal zero. This shows that
c 7→ E|Z − c| is minimized when c is a median of Z.

We have thus shown that the function g ∗ (x) which minimizes R(g) over all functions g is
given by any conditional mean of Y given X = x. Thus the conditional mean of Y given
X = x is the function of X that is closest to Y in terms of expected absolute error.

Problem 3: Suppose Y is a binary random variable taking the values 0 and 1 and let
X be an arbitrary random variable. What is the function g ∗ (X) of X that minimizes

R(g) := P{Y ̸= g(X)}

85
over all functions g? To solve this, again use the law of iterated expectation to write

R(g) = P{Y ̸= g(X)} = E (P {Y ̸= g(X)|X}) .

In the inner expectation above, we can treat X as a constant so that the problem is similar
to minimizing P{Z ̸= c} over c ∈ R for a binary random variable Z. It is easy to see that
P{Z ̸= c} is minimized at c∗ where

1 : if P{Z = 1} > P{Z = 0}
c∗ =
0 : if P{Z = 1} < P{Z = 0}

In case P{Z = 1} = P{Z = 0}, we can take c∗ to be either 0 or 1. From here it can
be deduced (via the law of iterated expectation) that the function g ∗ (X) which minimizes
P{Y ̸= g(X)} is given by

∗ 1 : if P{Y = 1|X = x} > P{Y = 0|X = x}
g (x) =
0 : if P{Y = 1|X = x} < P{Y = 0|X = x}

Problem 4: Suppose again that Y is binary taking the values 0 and 1 and let X be an
arbitrary random variable. What is the function g ∗ (X) of X that minimizes

R(g) := W0 P{Y ̸= g(X), Y = 0} + W1 P{Y ̸= g(X), Y = 1}.

Using an argument similar to the previous problems, deduce that the following function
minimizes R(g):

∗ 1 : if W1 P{Y = 1|X = x} > W0 P{Y = 0|X = x}
g (x) =
0 : if W1 P{Y = 1|X = x} < W0 P{Y = 0|X = x}

The argument (via the law of iterated expectation) used in the above four problems can
be summarized as follows. The function g ∗ which minimizes

R(g) := EL(Y, g(X))

over all functions g is given by

g ∗ (x) = minimizer of E(L(Y, c)|X = x) over c ∈ R.

16.2 Conditional Variance

Given two random variables Y and X, the conditional variance of Y given X = x is defined
as the variance of the conditional distribution of Y given X = x. More formally,
h i
V ar(Y |X = x) := E (Y − E(Y |X = x))2 |X = x = E Y 2 |X = x − (E(Y |X = x))2 .


Like conditional expectation, the conditional variance V ar(Y |X = x) is also a function of


x. We can apply this function to the random variable X to obtain a new random variable
which we denote by V ar(Y |X). Note that

V ar(Y |X) = E(Y 2 |X) − (E(Y |X))2 . (81)

Analogous to the Law of Total Expectation, there is a Law of Total Variance as well. This
formula says that

V ar(Y ) = E(V ar(Y |X)) + V ar(E(Y |X)).

86
To prove this formula, expand the right hand side as
n o
E(V ar(Y |X)) + V ar(E(Y |X)) = E E(Y 2 |X) − (E(Y |X))2 + E (E(Y |X))2 − (E(E(Y |X))2
= E(E(Y 2 |X)) − E(E(Y |X))2 + E(E(Y |X))2 − (E(Y ))2
= E(Y 2 ) − (EY )2 = V ar(Y ).

Example 16.4. We have seen before that

X|Θ = θ ∼ N (θ, σ 2 ) and Θ ∼ N (µ, τ 2 ) =⇒ X ∼ N (µ, σ 2 + τ 2 ).

This, of course, means that

E(X) = µ and V ar(X) = σ 2 + τ 2 .

Using the laws of total expectation and total variance, it is possible to prove these directly as
follows.
E(X) = E(E(X|Θ)) = E(Θ) = µ
and
V ar(X) = E(V ar(X|Θ)) + V ar(E(X|Θ)) = E(σ 2 ) + V ar(Θ) = σ 2 + τ 2 .

Example 16.5 (Sum of a random number of i.i.d random variables). Suppose X1 , X2 , . . .


are i.i.d random variables with E(Xi ) = µ and V ar(Xi ) = σ 2 < ∞. Suppose also that
N is a discrete random variable that takes values in {1, 2, . . . , } and that is independent of
X1 , X2 , . . . . Define
S := X1 + X2 + · · · + XN .
We have seen previously that

E(S) = E(E(S|N )) = E(N µ) = (µ)(EN ) = (EN )(EX1 ).

Using the law of total variance, we can calculate V ar(X) as follows.

V ar(S) = E(V ar(S|N )) + V ar(E(S|N )) = E(N σ 2 ) + V ar(N µ) = σ 2 (EN ) + µ2 V ar(N ).

17 Lecture Seventeen

We shall study the multivariate normal and multivariate t-distributions today. Before going
over them, let us first recall these densities in the univariate case.

17.1 Univariate normal and t densities

Let us start by looking at the standard normal distribution. We say that Z is standard
normal if its density is given by
 2
1 z
fZ (z) := √ exp − .
2π 2
The mean of Z is zero and its variance equals 1.

87
From standard normal, we define general normal distributions via a scale and location
change. Specifically, for two real numbers µ and a, define

X = µ + aZ.

The density of X is given by (using the change of variable formula):

(x − µ)2
 
1
fX (x) = √ exp − .
2π|a| 2a2

This density depends on the two quantities µ and a2 and is denoted by N (µ, a2 ). Note that
the quantity a can be positive or negative but the density only depends on |a| or a2 . This
density is called the Normal density with parameters µ and a2 . It is easy to check (using
X = µ + aZ) that the mean of X equals µ and variance of X equals a2 .

Next we come to the t-distribution. This is obtained by a further scale change involving
an independent chi-squared distributed random variable. Specifically consider a random
variable V that has the χ2k distribution (χ2k is the chi-squared distribution with k degrees
of freedom; recall that χ2k = Gamma(k/2, 1/2)) and assume that V and Z are independent.
Define
1
T =µ+ p aZ.
V /k
Thus T is very similar to X except for the additional scale change involving the random
variable V /k. This random variable has mean and variance given by:

E(χ2k )
 
V EV k
E = = = =1
k k k k
and
var(χ2k )
 
V var(V ) 2k 2
var = = = 2 = .
k k2 k2 k k
Thus when k is large, V /k is a random variable with mean 1 and very small variance which
implies that V /k will be highly concentrated around 1. Thus the additional scale change
involving V /k will play very little role if k is large. But if k is not very large, then it will
make the distribution of T considerably different from that of X. The density of T can be
explicitly calculated using the following argument.
Z ∞
fT (y) = fT |V =x (y)fV (x)dx.
0

Observe now that  


aZ 2k
T | V = x = µ + px ∼ N µ, a
k
x
so that √
x  x 
fT |V =x (y) = √ √ exp − 2 (y − µ)2 .
2πa k 2a k
As a result
Z ∞
fT (y) = fT |V =x (y)fV (x)dx
0
Z ∞ √
x  x  k
∝ √ √ exp − 2 (y − µ)2 x 2 −1 e−x/2 dx
0 2πa k 2a k
Z ∞
(y − µ)2
  
k
− 21 x
∝ x 2 exp − 1+ dx.
0 2 ka2

88
The change of variable
(y − µ)2
 
t=x 1+
ka2
now leads to
Z ∞
1 k 1
fT (y) ∝   k+1 t 2 −1 e−t/2 dt ∝   k+1 .
(y−µ) 2 2 0 (y−µ)2 2
1+ ka2
1+ ka2

The density of this random variable T will be denoted by tk (µ, a2 ). In other words, the
density of tk (µ, a2 ) is proportional to
1
y 7→   k+1 .
(y−µ)2 2
1+ ka2

This density has heavier tails compared to the normal density N (µ, σ 2 ). The mean of tk (µ, a2 )
exists if and only if k > 1 and equals µ. Its variance exists if and only if k > 2 and equals
k 2
k−2 a .

17.2 Random Vectors and Covariance Matrices

In order to discuss the multivariate normal distribution, we shall the language of random
vectors and covariance matrices which are defined next.

A finite number of random variables can be viewed together as a random vector. More pre-
cisely, a random vector is a vector whose entries are random variables. Let Y = (Y1 , . . . , Yn )T
be an n × 1 random vector. Its Expectation EY is defined as a vector whose ith entry is the
expectation of Yi i.e., EY = (EY1 , EY2 , . . . , EYn )T . The covariance matrix of Y , denoted by
Cov(Y ), is an n × n matrix whose (i, j)th entry is the covariance between Yi and Yj . Two
important but easy facts about Cov(Y ) are:

1. The diagonal entries of Cov(Y ) are the variances of Y1 , . . . , Yn . More specifically the
(i, i)th entry of the matrix Cov(Y ) equals var(Yi ).

2. Cov(Y ) is a symmetric matrix i.e., the (i, j)th entry of Cov(Y ) equals the (j, i) entry.
This follows because Cov(Yi , Yj ) = Cov(Yj , Yi ).

One can also check:

1. E(AY +c) = AE(Y )+c for every deterministic matrix A and every deterministic vector
c.

2. Cov(AY + c) = ACov(Y )AT for every deterministic matrix A and every deterministic
vector c.

As a consequence of the second formula above, we get


X
var(aT Y ) = aT Cov(Y )a = ai aj Cov(Yi , Yj ) for every n × 1 vector a.
i,j

17.3 Multivariate Normal and t-densities

We shall follow the same program as in the univariate case. We first define standard multi-
variate normal, then general multivariate normal followed by the multivariate t.

89
We say that a p × 1 random vector Z has the standard p-variate normal distribution if
its components Z1 , . . . , Zp are independently distributed according to the standard normal
i.i.d
distribution i.e., Z1 , . . . , Zp ∼ N (0, 1). The joint density of Z1 , . . . , Zp is then
p   2 
Y 1 z
fZ1 ,...,Zp (z1 , . . . , zp ) = √ exp − i
2π 2
i=1
p  Pp
zi2
 
1 i=1
= √ exp −
2π 2
p p
∥z∥2
     T 
1 1 z z
= √ exp − = √ exp −
2π 2 2π 2

The mean vector of Z is simply the zero vector and the covariance matrix of Z is the p × p
identity matrix Ip .

From the standard p-variate normal, we obtain a general p-variate normal distribution in
the following way. Suppose µ is a fixed p-dimensional vector and suppose A is a fixed p × p
invertible matrix. Define
X = µ + AZ
Here AZ is the matrix-vector multiplication of the p × p matrix A with the p × 1 vector Z.
By the Jacobian formula, the joint density of the components X1 , . . . , Xp of X is given by

fX1 ,...,Xp (x1 , . . . , xp ) = fZ1 ,...,Zp (A−1 (x − µ)) det(A−1 )


p T !
A−1 (x − µ) A−1 (x − µ)

1
= √ exp − det(A−1 )
2π 2
p T !
(x − µ)T A−1 A−1 (x − µ)

1
= √ exp − det(A−1 )
2π 2
p −1 −1 !
(x − µ)T AT A (x − µ)

1
= √ exp − det(A−1 )
2π 2
p −1 !
(x − µ)T AAT (x − µ)

1
= √ exp − det(A−1 ) .
2π 2

We now let
Σ := AAT .
Because the determinant of a product of matrices equals the product of the determinants

det(Σ) = det(AAT ) = det(A) det(AT ) = (det(A))2

which implies, in particular, that det(Σ) > 0. Using this (and the fact that the determinant
of the inverse of a matrix equals the inverse of the determinant), we can write
1 1
det(A−1 ) = =√ .
det A det Σ
We can thus write
p
(x − µ)T Σ−1 (x − µ)
  
1 1
fX1 ,...,Xp (x1 , . . . , xp ) = √ exp − √ .
2π 2 det Σ

This density depends on the vector µ as well as on the matrix Σ = AAT . It is therefore
denoted by Np (µ, Σ). It is easy to see (using X = µ + AZ) that µ is the mean vector of X

90
and Σ is the covariance matrix of X:

E(X) = E(µ + AZ) = µ + AE(Z) = µ


Cov(X) = Cov(µ + AZ) = ACov(Z)AT = A(Ip )AT = AAT = Σ.

Let us now define the multivariate t-density. This is obtained by changing the scale of the
multivariate normal density via a chi-squared random variable. Specifically, let V denote a
χ2k random variable that is independent of a p-variate standard normal vector Z. Let
1
T := µ + p AZ
V /k

for a p × 1 vector µ and an invertible p × p matrix A. Note that T is given by


 Pp 
A(1,j)Zj
  µ1 + j=1q V
T1 
 k


·  · 
   
 · = · . (82)
   
·  · 
 Pp 
Tp  j=1 A(p,j)Z j 
µp + q
V
k

Note that the scale change on each component of T is through the same scalar random
variable V .

The distribution of this random vector T will be denoted by tk,p (µ, Σ). Its density can be
derived just as in the univariate case in the following way:
Z ∞
fT1 ,...,Tp (y1 , . . . , yp ) = fT1 ,...,Tp |V =x (y1 , . . . , yp )fV (x)dx.
0

Observe that, when V is fixed at x, the random vector T becomes


 !T     
A A A  = Np µ, k AAT = Np µ, k Σ
T =µ+ p Z ∼ Np µ, p p
x/k x/k v/k x x

so that
"  −1 #
1 1 k
fT1 ,...,Tp |V =x (y) = q exp − (y − µ)T Σ (y − µ)
(2π) p/2 k
det( x Σ) 2 x

xp/2  x 
= p exp − (y − µ)T Σ−1 (y − µ)
(2π)p/2 k p/2 det(Σ) 2v

where we used det( xv Σ) = (v/x)p det(Σ). As a result

fT1 ,...,Tp (y1 , . . . , yp )


Z ∞
= fT1 ,...,Tp |V =x (y1 , . . . , yp )fV (x)dx
0
Z ∞
xp/2  x  k
∝ p exp − (y − µ)T Σ−1 (y − µ) x 2 −1 e−x/2 dx
0 (2π)p/2 k p/2 det(Σ) 2k
Z ∞   
p+k
−1 x 1 T −1
∝ x 2 exp − 1 + (y − µ) Σ (y − µ) dx.
0 2 k

91
The change of variable  
1
t = x 1 + (y − µ)T Σ−1 (y − µ)
k
leads to
Z ∞
1 k+p
−1
fT (y) ∝   k+p t 2 e−t/2 dt
1 + k1 (y − µ)T Σ−1 (y − µ) 2 0

1
∝  k+p .
1 + k1 (y − µ)T Σ−1 (y − µ) 2

Therefore the density corresponding to tk,p (µ, Σ) distribution is proportional to


1
y 7→   k+p . (83)
1
1+ k (y − µ)T Σ−1 (y − µ) 2

Note that, in the notation tk,p (µ, Σ), k denotes degrees of freedom, p denotes dimension, µ
and Σ = AAT denote the mean vector and covariance matrix of the corresponding normal
random vector µ + AZ.

As in the univariate case, when k (degrees of freedom) is large, tk,p (µ, Σ) is very close to
Np (µ, Σ).

As an application involving the multivariate normal and t-densities, we shall look at


Bayesian Linear Regression.

17.4 Bayesian Linear Regression

Here one observes data (x1 , y1 ), . . . , (xn , yn ). xi denotes the explanatory variable value and yi
denotes the response variable value for the ith individual. In usual linear regression analysis,
we assume the model
Yi = β0 + β1 xi + ϵi
for i = 1, . . . , n where
i.i.d
ϵ1 , . . . , ϵn ∼ N (0, σ 2 ).
There are three parameters in this model β0 , β1 and σ 2 . How to fit this model to the
observed data (x1 , y1 ), . . . , (xn , yn ) i.e., how do we estimate the parameters β0 , β1 , σ and also
characterize the uncertainty in the estimates.

As an alternative to the usual frequentist analysis, we shall apply probability theory to


solve this problem. The first step is to select a prior for the unknown parameters β0 , β1 , σ.
A reasonable prior reflecting ignorance is
i.i.d
β0 , β1 , log σ ∼ Unif(−C, C)

for a large number C (the exact value of C will not matter in the following calculations).
Note that as σ is always positive, we have made the uniform assumption on log σ (by the
change of variable formula, the density of σ would be given by fσ (x) = flog σ (log x) x1 =
I{−C<log x<C} I{e−C <x<eC }
2Cx = 2Cx .

The joint posterior for all the unknown parameters β0 , β1 , σ is then given by (below we
write the term “data” for Y1 = y1 , . . . , Yn = yn ):

fβ0 ,β1 ,σ|data (β0 , β1 , σ) ∝ fY1 ,...,Yn |β0 ,β1 ,σ (y1 , . . . , yn )fβ0 ,β1 ,σ (β0 , β1 , σ).

92
The two terms on the right hand side above are
n
Y
fY1 ,...,Yn |β0 ,β1 ,σ (y1 , . . . , yn ) ∝ fYi |β0 ,β1 ,σ (yi )
i=1
Yn
= fϵi |β0 ,β1 ,σ (yi − β0 − β1 xi )
i=1
n
(yi − β0 − β1 xi )2
 
Y 1
= √ exp −
2πσ 2σ 2
i=1
n
!
−n 1 X 2
∝ σ exp − 2 (yi − β0 − β1 xi ) ,

i=1

and

fβ0 ,β1 ,σ (β0 , β1 , σ) = fβ0 (β0 )fβ1 (β1 )fσ (σ)


I{−C < β0 < C} I{−C < β1 < C} I{e−C < σ < eC }

2C 2C 2Cσ
1
∝ I {−C < β0 , β1 , log σ < C} .
σ
We thus obtain

fβ0 ,β1 ,σ|data (β0 , β1 , σ)


n
!
−n−1 1 X
∝σ exp − 2 (yi − β0 − β1 xi )2 I {−C < β0 , β1 , log σ < C} .

i=1

The above is the joint posterior over β0 , β1 , σ. The posterior over only the main parameters
β0 , β1 can be obtained by integrating the parameter σ as follows:
Z
fβ0 ,β1 |data (β0 , β1 ) = fβ0 ,β1 ,σ|data (β0 , β1 , σ)dσ
n
Z eC !
−n−1 1 X 2
∝ I{−C < β0 , β1 < C} σ exp − 2 (yi − β0 − β1 xi ) dσ.
e−C 2σ
i=1

When C is large, the above integral can be evaluated from 0 to ∞ which gives
n
!
Z ∞
−n−1 1 X
fβ0 ,β1 |data (β0 , β1 ) ∝ I{−C < β0 , β1 < C} σ exp − 2 (yi − β0 − β1 xi )2 dσ.
0 2σ
i=1

The change of variable


σ
s = pPn
2
i=1 (yi − β0 − β1 xi )
allows us to write the integral as
n
!
Z ∞
1 X
σ −n−1 exp − 2 (yi − β0 − β1 xi )2 dσ
0 2σ
i=1
n
!−n/2 Z n
!−n/2
∞  
X
2 −n−1 1 X
= (yi − β0 − β1 xi ) s exp − 2 ds ∝ (yi − β0 − β1 xi )2 .
0 2s
i=1 i=1

93
The posterior density of (β0 , β1 ) is thus
n
!−n/2
X
fβ0 ,β1 |data (β0 , β1 ) ∝ I{−C < β0 , β1 < C} (yi − β0 − β1 xi )2 . (84)
i=1
A key role in the above posterior is played by the least squares criterion:
n
X
S(β0 , β1 ) = (yi − β0 − β1 ti )2 .
i=1

The usual point estimates of β0 and β1 are simply the minimizers β̂0 and β̂1 of the least
squares criterion S(β0 , β1 ).

We can rewrite the posterior (84) as


 n/2
1
fβ0 ,β1 |data (β0 , β1 ) ∝ I{−C < β0 , β1 < C}
S(β0 , β1 )
!n/2
S(β̂0 , β̂1 )
∝ I{−C < β0 , β1 < C} (85)
S(β0 , β1 )

Note that we have been able to bring in the term (S(β̂0 , β̂1 ))n/2 because it does not depend
on β0 , β1 and is thus a constant.

Generally, the density (89) will be quite sharply concentrated around the least squares
estimator (β̂0 , β̂1 ) especially when n is large. This is because, when (β0 , β1 ) is such that
S(β0 , β1 ) is large compared to S(β̂0 , β̂1 ), the quantity
!n/2
S(β̂0 , β̂1 )
S(β0 , β1 )
would be quite negligible because of the large power n/2. As a result, the posterior density
fβ0 ,β1 |data (β0 , β1 ) will be concentrated around those values of (β0 , β1 ) for which S(β0 , β1 ) is
quite close to S(β̂0 , β̂1 ). For a concrete example, suppose n = 762 and (β0 , β1 ) is such that
S(β0 , β1 ) = (1.1)S(β̂0 , β̂1 ). Then
!n/2  
S(β̂0 , β̂1 ) 1 381
= ≈ 1.7 × 10−16 .
S(β0 , β1 ) 1.1
Such (β0 , β1 ) will thus get negligible posterior probability. Even for (β0 , β1 ) such that
S(β0 , β1 ) = (1.01)S(β̂0 , β̂1 ), we have
!n/2 
1 381

S(β̂0 , β̂1 )
= ≈ 0.02
S(β0 , β1 ) 1.01
and so such (β0 , β1 ) will also get fairly small posterior probability.

To sum up, when n is large, the posterior probability will be concentrated around those
(β0 , β1 ) for which S(β0 , β1 ) is very close to S(β̂0 , β̂1 ). Generally, this would imply that
(β0 , β1 ) would itself have to be close to (β̂0 , β̂1 ). For this reason, the indicator term in (89)
has no effect when C is large. We can thus drop this indicator term and refer to the Bayesian
posterior as simply
!n/2
S(β̂0 , β̂1 )
fβ0 ,β1 |data (β0 , β1 ) ∝ . (86)
S(β0 , β1 )
We shall show in the next class that the above posterior density is simply the multivariate
t-density.

94
18 Lecture Eighteen

18.1 Recap: Multivariate Normal and t Distributions

In the last class, we looked at the multivariate normal distribution Np (µ, Σ) (p denotes
dimension, µ denotes mean vector and Σ denotes covariance matrix) with density:
 p  
1 1 1
√ √ exp − (x − µ)T Σ−1 (x − µ) .
2π det Σ 2
and the multivariate t-distribution tk,p (µ, Σ) (k is the degrees of freedom) with density pro-
portional to
! k+p
2
1
. (87)
1 + k1 (x − µ)T Σ−1 (x − µ)
One can generate random vectors having these distributions in the following way. First
consider a p × 1 random vector Z whose components Z1 , . . . , Zp are i.i.d standard normal.
Then
X = µ + AZ
has the Np (µ, Σ) distribution with Σ = AAT . Also
1
T =µ+ p AZ (88)
V /k
has the tk,p (µ, Σ) distribution. Here V has the χ2k distribution and we assume that V and Z
are independent.

The following fact will be useful in the sequel.

Fact 18.1. If T ∼ tk,p (µ, Σ) has components T1 , . . . , Tp , then, for each i = 1, . . . , p,


Ti ∼ tk (µi , Σ(i, i))
where µi is the ith component of µ and Σ(i, i) is the (i, i)th entry of Σ. In words, each Ti
has the univariate t-distribution.

Proof. This fact follows directly from (88) because


p
1 X
Ti = µi + p A(i, j)Zj
V /k j=1
Pp
Now j=1 A(i, j)Zj has the normal distribution with mean zero and variance
p
X p
X
2
(A(i, j)) = A(i, j)AT (j, i) = (AAT )(i, i) = Σ(i, i).
j=1 j=1

Therefore we can write


Xp
p
A(i, j)Zj = Σ(i, i)W where W ∼ N (0, 1).
j=1

Thus
1 p
Ti = µi + p Σ(i, i)W.
V /k
This
p has the same form as (88) except insteadpof AZ, we have the univariate product
Σ(i, i)W where W ∼ N (0, 1). Thus Ti ∼ tk (µi , Σ(i, i)).

95
18.2 Application to Linear Regression

We considered the usual linear regression model in the last class. One observes data
(x1 , y1 ), . . . , (xn , yn ). xi denotes the explanatory variable value and yi denotes the response
variable value for the ith individual. We consider the model

Yi = β0 + β1 xi + ϵi

for i = 1, . . . , n where
i.i.d
ϵ1 , . . . , ϵn ∼ N (0, σ 2 ).
There are three parameters in this model β0 , β1 and σ 2 . The goal is to estimate the param-
eters β0 , β1 and also characterize the uncertainty in the estimates.

We worked with the prior distribution


i.i.d
β0 , β1 , log σ ∼ Unif(−C, C)

for a large number C and calculated the posterior density of β0 , β1 (by integrating the full
posterior of β0 , β1 , σ over σ) to be

n
!−n/2
X
fβ0 ,β1 |data (β0 , β1 ) ∝ I{−C < β0 , β1 < C} (yi − β0 − β1 xi )2 .
i=1

We show below that this density is very closely related to the multivariate t-density (87) .
To see this, let us use the notation
n
X
S(β0 , β1 ) = (yi − β0 − β1 xi )2 .
i=1

The usual point estimates of β0 and β1 are simply the minimizers β̂0 and β̂1 of the least
squares criterion S(β0 , β1 ).

We can then rewrite the above posterior as


 n/2
1
fβ0 ,β1 |data (β0 , β1 ) ∝ I{−C < β0 , β1 < C}
S(β0 , β1 )
!n/2
S(β̂0 , β̂1 )
∝ I{−C < β0 , β1 < C} (89)
S(β0 , β1 )

Note that we have been able to bring in the term (S(β̂0 , β̂1 ))n/2 because it does not depend
on β0 , β1 and is thus a constant.

Using the notation


   
y1 1 x1
· · ·     
    β0 β̂0
Y =  ·  and X = 
  · ·  and β := and β̂ := ,
·

· · 
 β1 β̂1
yn 1 xn

we can write

S(β0 , β1 ) = S(β) = ∥Y − Xβ∥2 and S(β̂0 , β̂1 ) = S(β̂) = ∥Y − X β̂∥2

96
We now use the following Pythagorean decomposition

S(β) = ∥Y − Xβ∥2 = ∥Y − X β̂∥2 + ∥Xβ − X β̂∥2 = S(β̂) + (β − β̂)T X T X(β − β̂).

We can thus write


!n/2
S(β̂)
fβ0 ,β1 |data (β) ∝ I{−C < β0 , β1 < C}
S(β)
!n/2
S(β̂)
= I{−C < β0 , β1 < C}
S(β̂) + (β − β̂)T X T X(β − β̂)
 n/2
1
= 
T
  I{−C < β0 , β1 < C}.
1 + (β − β̂)T X X (β − β̂)
S(β̂)

If we ignore the indicator above, the above density is simply the multivariate t-density with
dimension p = 2, degrees of freedom k = n − 2, mean parameter β̂ and covariance matrix
parameter Σ where

n−2 S(β̂)
Σ−1 = (X T X) so that Σ = (X T X)−1 .
S(β̂) n−2

Therefore the posterior density is just the tn−2,2 (β̂, Σ) density truncated to the set −C <
β0 , β1 < C. When C is large, this truncation will have little practical effect so we can just
treat the posterior density as tn−2,2 (β̂, Σ).

Let us now use some standard regression terminology:

Residuals are yi − β̂0 − β̂1 xi for i = 1, . . . , n.


n 
X 2
S(β̂) = S(β̂0 , β̂1 ) = yi − β̂0 − β̂1 xi is the Residual Sum of Squares (RSS)
i=1
s
S(β̂)
σ̂ := is the Residual Standard Error.
n−2
We have thus proved that

β | data ∼ tn−2,2 (β̂, σ̂ 2 (X T X)−1 ).

With this posterior density, one can do uncertainty quantification about the parameters β0
and β1 . One can generate multiple samples from tn−2,2 (β̂, σ̂ 2 (X T X)−1 ) and plot the resulting
lines to visualize the uncertainty in β0 and β1 . One can also use Fact 18.1 to deduce that

β0 | data ∼ tn−2 (β̂0 , σ̂ 2 (X T X)11 ) and β1 | data ∼ tn−2 (β̂1 , σ̂ 2 (X T X)22 )

where (X T X)11 and (X T X)22 are the first and second diagonal entries of (X T X)−1 re-
spectively. These univariate t-densities describe the marginal uncertainty in the intercept
and slope parameters. When n is large, these will be close to the normal p distributions
Np 2 T 11 2 T 22
(β̂0 , σ̂ (X X) ) and N (β̂1 , σ̂ (X X) ) respectively. The quantities σ̂ (X T X)11 and
σ̂ (X T X)22 are known as the standard errors of the intercept and the slope respectively.

97
18.3 Multiple Linear Regression

The analysis for multiple linear regression is very similar to analysis of the last section. Here
one observes data (yi , xi1 , xi2 , . . . , xim ) for i = 1, . . . , n. There are m explanatory variables
x1 , . . . , xm and one response variable. xij denotes the value of the j th explanatory variable
for the ith individual and yi is the value of the response variable for the ith individual. The
model is
Yi = β0 + β1 xi1 + β2 xi2 + · · · + βm xim + ϵi
for i = 1, . . . , n where
i.i.d
ϵ1 , . . . , ϵn ∼ N (0, σ 2 ).
The goal is to estimate the parameters β0 , . . . , βm as well as σ from the data. σ is usu-
ally treated as a nuisance parameter and the main parameters of interest are β0 , . . . , βm .
The model studied in the previous section is often called simple linear regression and it
corresponds to m = 1 (i.e., there is only one explanatory variable).

We shall work with the prior distribution:


i.i.d
β0 , β1 , . . . , βm , log σ ∼ Unif(−C, C).

Under this assumption, it can be easily seen that the posterior for β0 , . . . , βm is given (just
like in the last section) as
!n/2
S(β̂0 , . . . , β̂m )
fβ0 ,β1 ,...,βm |data ∝ I {−C < β0 , . . . , βm < C}
S(β0 , . . . , βm )

where
n
X
S(β0 , . . . , βm ) = (yi − β0 − β1 xi1 − · · · − βm xim )2
i=1

is the least squares criterion, and β̂0 , . . . , β̂m are the least squares estimators (these are the
minimizers of S(β0 , . . . , βm )). Using the matrix notation:

· · · x1m
 
1 x11   
  β0 β̂0
y1 · · · · · · 
   β1   β̂1 
 
· · · · · · ·   
     ·   · 
 
·
Y = and X = 1 xi1 · · · xim  and β :=  and β̂ :=   ,
  
  ·   · 
· · · · · · ·   
   ·   · 
 
yn · · · · · · 
βm β̂m
1 xn1 · · · xnm

it can be seen that


β̂ = (X T X)−1 X T Y
and
S(β) = ∥Y − Xβ∥2 .
We can then show (using the same Pythagorean decomposition as in the last section) that
 n/2
1
fβ0 ,β1 ,...,βm |data (β0 , . . . , βm ) ∝     I{−C < β0 , . . . , βm < C}.
XT X
1 + (β − β̂)T (β − β̂)
S(β̂)

98
If we ignore the indicator above, the above density is simply the multivariate t-density with
dimension p = m+ 1, degrees of freedom k = n−p, mean parameter β̂ and covariance matrix
parameter Σ where

n−p S(β̂)
Σ−1 = (X T X) so that Σ = (X T X)−1 .
S(β̂) n−p

Therefore the posterior density is just the tn−p,p (β̂, Σ) density truncated to the set −C <
β0 , β1 , . . . , βm < C. When C is large, this truncation will have little practical effect so we
can just treat the posterior density as tn−p,p (β̂, Σ). We can then use the Fact 18.1 to obtain
marginal t-distributions for each individual component βj , j = 0, 1, . . . , m.

Multiple Linear Regression can be used to fit even when there is only one explanatory
variable to fit certain nonlinear functions of the explanatory variable. For example, one can
fit quadratic functions via the model:

Yi = β0 + β1 xi + β2 x2i + ϵi (90)
i.i.d
with ϵi ∼ N (0, σ 2 ) by the multiple linear regression methodology with
1 x1 x21
 
· · · 
 
X= · · · 
.
· · · 
1 xn x2n
 
β0
The posterior density of β = β1  will then be given by tn−3,3 (β̂, σ̂ 2 (X T X)−1 ) (note that
β2
the dimension now is 3 and the degrees of freedom is n − 3).

A more general polynomial trend model (of degree k) can be fit analogously (the dimension
of β will then be k + 1 and the degrees of freedom of the posterior t-density will be n − k − 1).

Other examples include the following. To capture seasonal trend in time series data (here
the explanatory variable is time with values t1 , . . . , tn ) with known period s (for example
s = 12 in monthly data), one can use a model of the form
r  
X 2πf ti 2πf ti
Yi = β 0 + βf 1 cos + βf 2 sin + ϵi .
s s
f =1

This is also a linear regression model with

β = (β0 , β11 , β12 , β21 β22 , . . . , βr1 , βr2 )T

and the ith row of the n × (2r + 1) matrix X is given by


 
2πti 2πti 2π(2)ti 2π(2)ti 2π(r)ti 2π(r)ti
1, cos , sin , cos , sin , . . . , cos , sin .
s s s s s s
Here the posterior t-density of β will have dimension 2r+1 and degrees of freedom n−(2r+1).

Time series datasets often have both trend and seasonality. These effects can be estimated
by models of the form:
k r  
(1) 2πf ti 2πf ti
βj tji
X X
Yi = + βf 1 cos + βf 2 sin + ϵi .
s s
j=0 f =1

99
Inference for this model can also be done through linear regression. The degrees of freedom
for the posterior t-density of the coefficients will now be n − (2r + k + 1). Our methodology
will work as long as n > 2r + k + 1.

18.4 Models with Nonlinear Parameter Dependence

The Bayesian methodology can be used even to fit models with nonlinear parameter depen-
dence such as:
Yi = β0 + β1 cos(2πf xi ) + β2 sin(2πf xi ) + ϵi (91)
i.i.d
with ϵi ∼ N (0, σ 2 ). The setting here is the usual simple linear regression setting (there
is only one explanatory variable). The parameters are β0 , β1 , β2 , σ as well as the frequency
parameter f . One cannot use linear regression methodology directly here because the pa-
rameter f appears nonlinearly in the equation (91). We shall see how to handle this in the
next class.

19 Lecture Nineteen

19.1 Last Class: Linear Regression

Last class, we used probability to perform inference in the usual linear regression model. Here
one observes data (yi , xi1 , xi2 , . . . , xim ) for i = 1, . . . , n. There are m explanatory variables
x1 , . . . , xm and one response variable. xij denotes the value of the j th explanatory variable
for the ith individual and yi is the value of the response variable for the ith individual. The
model is
Yi = β0 + β1 xi1 + β2 xi2 + · · · + βm xim + ϵi
for i = 1, . . . , n where
i.i.d
ϵ1 , . . . , ϵn ∼ N (0, σ 2 ).
The goal is to estimate the parameters β0 , . . . , βm as well as σ from the data. σ is usually
treated as a nuisance parameter and the main parameters of interest are β0 , . . . , βm .

We used the prior distribution:


i.i.d
β0 , β1 , . . . , βm , log σ ∼ Unif(−C, C).
Under this assumption, we derived the posterior distribution for β0 , . . . , βm to be
!n/2
S(β̂0 , . . . , β̂m )
fβ0 ,β1 ,...,βm |data ∝ I {−C < β0 , . . . , βm < C}
S(β0 , . . . , βm )
where
n
X
S(β0 , . . . , βm ) = (yi − β0 − β1 xi1 − · · · − βm xim )2
i=1

is the least squares criterion, and β̂0 , . . . , β̂m are the least squares estimators (these are the
minimizers of S(β0 , . . . , βm )).

We also saw that the posterior density can also be written as


 n/2
1
fβ0 ,β1 ,...,βm |data (β0 , . . . , βm ) ∝     I{−C < β0 , . . . , βm < C}.
T XT X
1 + (β − β̂) (β − β̂)
S(β̂)

100
using the notation:

· · · x1m
 
1 x11   

  β0 β̂0
y1 · · · · · · 
   β1   β̂1 
 
· · · · · · ·   
     ·   · 
 
·
Y = and X =  1 xi1 · · · xim  and β :=  and β̂ :=   .
 
   ·   · 
· · · · · · ·   
   ·   · 
 
yn · · · · · · 
βm β̂m
1 xn1 · · · xnm

If we ignore the indicator function, the above density is simply the multivariate t-density
with dimension p = m + 1, degrees of freedom k = n − p, mean parameter β̂ and covariance
matrix parameter Σ where

n−p S(β̂)
Σ−1 = (X T X) so that Σ = (X T X)−1 .
S(β̂) n−p

Therefore the posterior density is just the tn−p,p (β̂, Σ) density truncated to the set −C <
β0 , β1 , . . . , βm < C. When C is large, this truncation will have little practical effect so we
can just treat the posterior density as tn−p,p (β̂, Σ). Point estimates for β will just be the
least squares estimator β̂, and uncertainty is usually summarized by the standard errors
which are simply the square rootsq of the diagonal entries of Σ. In other words, the standard
S(β̂)
error corresponding to β̂j equals n−p multiplied by the square-root of the corresponding
diagonal entry of (X T X)−1 .

19.2 Nonlinear Regression Models

In this framework, parameter inference in nonlinear regression models is handled in a very


similar way. For a concrete example, consider the model:

Yi = β0 + β1 exp (−β2 xi ) + ϵi
i.i.d
for i = 1, . . . , n where, as in the previous section, ϵi ∼ N (0, σ 2 ). This is a nonlinear regres-
sion model because the parameter β2 enters via the exponential function which is nonlinear.
There are four unknown parameters β0 , β1 , β2 , σ. We can obtain parameter estimates and
standard errors for them in a manner that is very similar to the analysis in linear regression.
We work with the prior:
i.i.d
β0 , β1 , β2 , log σ ∼ Unif(−C, C)
for a large C > 0. The posterior density of the parameters is then given by: The joint
posterior for all the unknown parameters β0 , β1 , β2 , σ is then given by (below we write the
term “data” for Y1 = y1 , . . . , Yn = yn ):

fβ0 ,β1 ,β2 ,σ|data (β0 , β1 , β2 , σ) ∝ fY1 ,...,Yn |β0 ,β1 ,β2 ,σ (y1 , . . . , yn )fβ0 ,β1 ,β2 ,σ (β0 , β1 , β2 , σ).

101
The two terms on the right hand side above are the likelihood:
n
Y
fY1 ,...,Yn |β0 ,β1 ,β2 ,σ (y1 , . . . , yn ) ∝ fYi |β0 ,β1 ,β2 ,σ (yi )
i=1
Yn
= fϵi |β0 ,β1 ,β2 ,σ (yi − β0 − β1 exp(−β2 xi ))
i=1
n
!
Y 1 (yi − β0 − β1 exp(−β2 xi ))2
= √ exp −
2πσ 2σ 2
i=1
n
!
1 X
∝σ −n
exp − 2 (yi − β0 − β1 exp(−β2 xi ))2 ,

i=1

and

fβ0 ,β1 ,β2 ,σ (β0 , β1 , β2 , σ) = fβ0 (β0 )fβ1 (β1 )fβ2 (β2 )fσ (σ)
I{−C < β0 < C} I{−C < β1 < C} I{−C < β2 < C} I{e−C < σ < eC }

2C 2C 2C 2Cσ
1
∝ I {−C < β0 , β1 , β2 , log σ < C} .
σ
We thus obtain

fβ0 ,β1 ,β2 ,σ|data (β0 , β1 , β2 , σ)


n
!
1 X
∝σ −n−1
exp − 2 (yi − β0 − β1 exp(−β2 xi ))2 I {−C < β0 , β1 , β2 , log σ < C} .

i=1

Using the notation


n
X
S(β0 , β1 , β2 ) := (yi − β0 − β1 exp(−β2 xi ))2
i=1

for the sum of squares criterion, we can write the posterior as We thus obtain
 
−n−1 S(β0 , β1 , β2 )
fβ0 ,β1 ,β2 ,σ|data (β0 , β1 , β2 , σ) ∝ σ exp − I {−C < β0 , β1 , β2 , log σ < C} .
2σ 2

Often our interest is only in the parameters β0 , β1 , β2 (σ is a nuisance parameter). To obtain


the posterior of β0 , β1 , β2 , we integrate the full posterior above with respect to σ. Assuming
that C is large, we can do the integral from 0 to ∞ and this leads to (the calculation is the
same as in the linear regression case)
 n/2
1
fβ0 ,β1 ,β2 |data (β0 , β1 , β2 ) ∝ I {−C < β0 , β1 , β2 < C} .
S(β0 , β1 , β2 )

This posterior density will take its largest value when (β0 , β1 , β2 ) minimizer the sum of
squares S(β0 , β1 , β2 ). In other words, the maximu posterior density will be achieved by the
least squares estimator:

(β̂0 , β̂1 , β̂2 ) = minimizer of S(β0 , β1 , β2 ).

Unlike in the linear regression case where we can write the least squares estimator in closed
form as (X T X)−1 X T Y , we may not be able to write the least squares estimator in this
nonlinear regression model in closed form. Nevertheless, generally there exists a unique least

102
squares estimator. The posterior distribution will assign nonnegligible probability only to
those parameter values β0 , β1 , β2 for which S(β0 , β1 , β2 ) is close to the smallest possible value
S(β̂0 , β̂1 , β̂2 ). This can be seen, for example, by rewriting the posterior density as
!n/2
S(β̂0 , β̂1 , β̂2 )
fβ0 ,β1 ,β2 |data (β0 , β1 , β2 ) ∝ I {−C < β0 , β1 , β2 < C} .
S(β0 , β1 , β2 )

For this reason, we can neglect the indicator function above (because the action will be very
close to the least squares estimator β̂0 , β̂1 , β̂2 ) and write
!n/2
S(β̂0 , β̂1 , β̂2 )
fβ0 ,β1 ,β2 |data (β0 , β1 , β2 ) ∝ .
S(β0 , β1 , β2 )

Unlike in the linear regression case, the right hand side above is not the (unnormalized)
density of a multivariate t-distribution. However, we can approximate it by a multivariate t-
distribution by a second Taylor expansion of S(β0 , β1 , β2 ) around the least squares estimator
β̂0 , β̂1 , β̂2 . This Taylor expansion is justified because the posterior density will usually be
quite concentrated around β̂0 , β̂1 , β̂2 . Writing β for the vector (β0 , β1 , β2 ) and β̂ for the
vector (β̂0 , β̂1 , β̂2 ), Taylor expansion is
D E 1 T  
S(β) ≈ S(β̂) + ∇S(β̂), β − β̂ + β − β̂ HS(β̂) β − β̂
2
where ∇S(β̂) and HS(β̂) are the gradient and Hessian of S at β̂. Because β̂ minimizes S(β),
we have ∇S(β̂) = 0 and so
1 T  
S(β) ≈ S(β̂) + β − β̂ HS(β̂) β − β̂
2
Thus the posterior density is approximated by
 n/2
S(β̂)
fβ0 ,β1 ,β2 |data (β0 , β1 , β2 ) ∝  .
 
 T  
1
S(β̂) + 2 β − β̂ HS(β̂) β − β̂

Writing p = 3 for the dimension of the vector β, we get


 n/2
1
fβ|data (β) ∝ 
 
 T  
1
1+ β − β̂ HS(β̂) β − β̂
2S(β̂)
  p+(n−p)
2

1
=
 
 T   
1 n−p HS(β̂)
1+ n−p S(β̂) β − β̂ 2 β − β̂

This is clearly a multivariate t-density. More specifically,


 −1 !
S(β̂) 1
β | data ∼ tn−p,p β̂, HS(β̂) . (92)
n−p 2

One can summarize this posterior distribution by simply reporting the point estimates
β̂0 , β̂1 , β̂2 and their standard errors which are the square roots of the diagonal entries of
 −1
S(β̂) 1
HS(β̂) .
n−p 2

103
The earlier linear regression analysis is a special case of this analysis because we recover the
earlier result by taking S(β) = ∥Y − Xβ∥2 (in this case HS(β) = 2X T Xβ).

It should be noted that the result (92) is an approximation (in other words, the exact
posterior is not t-distributed) obtained by the second order Taylor expansion of S(β) around
β̂. An exact analysis of the posterior can be done in the following way. In this specific
problem, there are three β parameters; β0 , β1 , β2 . The model is linear in β0 , β1 for every fixed
β2 . This means that the conditional posterior of β0 , β1 for fixed β2 is exactly t-distributed.
So the marginal posterior density of β2 can be calculated exactly. We shall do this analysis
in more generality in the next section.

19.3 More on Nonlinear Regression Models

Consider the nonliear regression model written in vector-matrix notation as:

Y = X(ω)β + ϵ (93)

where Y is n × 1, ω is a k × 1 vector of unknown parameters, X(ω) is an n × p matrix that


depends in a known way on the unknown parameters in ω, β is a p × 1 vector of unknown
parameters and ϵ is a n × 1 vector consisting of i.i.d N (0, σ 2 ) errors. This model has k + p + 1
parameters: k elements of ω, p elements of β and σ. The model depends linearly on the β
parameters but possibly nonlinearly on the ω parameters. This setting includes the following
special cases.

1. In the concrete example considered in the previous section, we can take ω = (β2 ) and
β = (β0 , β1 ). The matrix X(ω) is given by
 
1 exp(−β2 x1 )
1 exp(−β2 x2 ) 
 
· · 
X(ω) = X(β2 ) = 
·

 · 

· · 
1 exp(−β2 xn )

2. Consider the model

Yi = β0 + β1 cos(2πf xi ) + β2 sin(2πf xi ) + ϵi .

This is a nonlinear regression model where we are modeling the response as a sinusoidal
function of the explanatory variable where the sinusoid has an unknown frequency f .
This is a special case of (95) with ω = f , β is the vector with components β0 , β1 , β2
and the X(ω) matrix is
 
1 cos(2πf x1 ) sin(2πf x1 )
1 cos(2πf x2 ) sin(2πf x2 ) 
 
· · · 
X(ω) = X(f ) = 
 
· · · 

· · · 
1 cos(2πf xn ) sin(2πf xn )

3. Consider the model


Yi = β0 + β1 I{xi > ω} + ϵi

104
This is a changepoint in mean model where the response variable has mean β0 when
x is atmost ω and mean β0 + β1 when x exceeds ω. This is also a special case of (95)
with  
1 I{x1 > ω}
1 I{x2 > ω} 
 
· · 
X(ω) =  
· · 

· · 
1 I{xn > ω}

4. Consider the model


Yi = β0 + β1 xi + β2 (xi − ω)+ + ϵi
This is a broken stick regression model where the regression line has slope β1 when
the covariate is at most ω and has slope β1 + β2 when the covariate exceeds ω. Here
(x − ω)+ = max(x − ω, 0). This is also a special case of (95) with

1 x1 (x1 − ω)+
 
1 x2 (x2 − ω)+ 
 
· · · 
X(ω) = 
 
· · · 

· · · 
1 xn (xn − ω)+

The likelihood of the model (95) is


n
∥Y − X(ω)β∥2
  
1
√ exp −
2πσ 2σ 2

To perform Bayesian analysis in the model (95), we assume as before that all the components
of ω, all the components of β and log σ are all i.i.d uniformly distributed on (−C, C) for a
large C. The posterior density is then given by

∥Y − X(ω)β∥2
 
−n−1
fω,β,σ|data (ω, β, σ) ∝ σ exp −
2σ 2

where we have ignored the indicator functions (assuming C is large). Often the main interest
is in the ω parameters. So we integrate the posterior with respect to β and σ. This gives
Z ∞
∥Y − X(ω)β∥2
Z  
−n−1
fω|data (ω) ∝ σ exp − dβdσ (94)
0 Rp 2σ 2

Now if β̂(ω) is the least squares estimator for fixed ω:

β̂(ω) := argmin ∥Y − X(ω)β∥2 ,


β

then using

∥Y − X(ω)β∥2 = ∥Y − X(ω)β̂(ω)∥2 + ∥X(ω)β − X(ω)β̂(ω)∥2


 T  
= ∥Y − X(ω)β̂(ω)∥2 + β − β̂(ω) X(ω)T X(ω) β − β̂(ω) ,

105
the integral (96) then becomes
Z ∞Z ! !
−n−1 ∥Y − X(ω)β̂(ω)∥2 (β − β̂(ω))X(ω)′ X(ω)(β − β̂(ω))
σ exp − exp − dβdσ
0 Rp 2σ 2 2σ 2
Z ∞ !Z !
∥Y − X(ω) β̂(ω)∥2 (β − β̂(ω))X(ω) ′ X(ω)(β − β̂(ω))
= σ −n−1 exp − exp − dβdσ.
0 2σ 2 Rp 2σ 2

We shall now use the formula:


Z  
1 T −1
p
exp − (x − µ) Σ (x − µ) dx1 . . . dxp = (2π)p/2 det(Σ)
Rp 2
where Σ is a p × p positive definite matrix and the integral is over x = (x1 , . . . , xp ). This is
basically the formula for the normalizing constant for the multivariate normal distribution.

This formula with Σ−1 = X(ω)′ X(ω)/(σ 2 ) (or equivalently Σ = σ 2 (X(ω)′ X(ω))−1 ) gives
!
(β − β̂(ω))T X(ω)T X(ω)(β − β̂(ω))
Z
exp − dβ
Rp 2σ 2
−1/2
= (2π)p/2 det (σ 2 (X(ω)′ X(ω))−1 ) = (2π)p/2 σ p det(X(ω)′ X(ω))
p
.
The required integral is
!

∥Y − X(ω)β̂(ω)∥2
Z
p/2 ′ −1/2 −n+p−1
(2π) (det(X(ω) X(ω))) σ exp − dσ.
0 2σ 2

The change of variable


σ
t=
∥Y − X(ω)β̂(ω)∥
then gives
!

∥Y − X(ω)β̂(ω)∥2
Z
p/2 ′ −1/2 −n+p−1
(2π) (det(X X)) σ exp − dσ
0 2σ 2
Z ∞  
p/2 ′ −1/2 −n+p −n+p−1 1
= (2π) (det(X(ω) X(ω))) ∥Y − X(ω)β̂(ω)∥ t exp − 2 dt
0 2t
∝ (det(X(ω)′ X(ω)))−1/2 ∥Y − X(ω)β̂(ω)∥−n+p .
Putting everything together, we have proved that
fω|data (ω) ∝ (det(X(ω)′ X(ω)))−1/2 ∥Y − X(ω)β̂(ω)∥−n+p .
This function of ω can be numerically understood when the dimension of ω is low. For
example, if ω is one-dimensional, we can plot the posterior density on the computer and
normalize it so the density integrates to one.

20 Lecture Twenty

20.1 Last Class: Nonlinear Regression Models with both linear and nonlinear
parameter dependence

Consider the nonliear regression model written in vector-matrix notation as:


Y = X(ω)β + ϵ (95)

106
where Y is n × 1, ω is a k × 1 vector of unknown parameters, X(ω) is an n × p matrix that
depends in a known way on the unknown parameters in ω, β is a p × 1 vector of unknown
parameters and ϵ is a n × 1 vector consisting of i.i.d N (0, σ 2 ) errors. This model has k + p + 1
parameters: k elements of ω, p elements of β and σ. The model depends linearly on the β
parameters but possibly nonlinearly on the ω parameters. This setting includes the following
special cases.

1. In the concrete example considered in the previous section, we can take ω = (β2 ) and
β = (β0 , β1 ). The matrix X(ω) is given by
 
1 exp(−β2 x1 )
1 exp(−β2 x2 ) 
 
· · 
X(ω) = X(β2 ) = 
·

 · 

· · 
1 exp(−β2 xn )

2. Consider the model

Yi = β0 + β1 cos(2πf xi ) + β2 sin(2πf xi ) + ϵi .

This is a nonlinear regression model where we are modeling the response as a sinusoidal
function of the explanatory variable where the sinusoid has an unknown frequency f .
This is a special case of (95) with ω = f , β is the vector with components β0 , β1 , β2
and the X(ω) matrix is
 
1 cos(2πf x1 ) sin(2πf x1 )
1 cos(2πf x2 ) sin(2πf x2 ) 
 
· · · 
X(ω) = X(f ) = 
 
· · · 

· · · 
1 cos(2πf xn ) sin(2πf xn )

3. Consider the model


Yi = β0 + β1 I{xi > ω} + ϵi
This is a changepoint in mean model where the response variable has mean β0 when
x is atmost ω and mean β0 + β1 when x exceeds ω. This is also a special case of (95)
with  
1 I{x1 > ω}
1 I{x2 > ω} 
 
· · 
X(ω) = ·

 · 

· · 
1 I{xn > ω}

4. Consider the model


Yi = β0 + β1 xi + β2 (xi − ω)+ + ϵi
This is a broken stick regression model where the regression line has slope β1 when
the covariate is at most ω and has slope β1 + β2 when the covariate exceeds ω. Here

107
(x − ω)+ = max(x − ω, 0). This is also a special case of (95) with
x1 (x1 − ω)+
 
1
1
 x2 (x2 − ω)+  
· · · 
X(ω) =  ·

 · · 

· · · 
1 xn (xn − ω)+

The likelihood of the model (95) is


n
∥Y − X(ω)β∥2
  
1
√ exp −
2πσ 2σ 2
To perform Bayesian analysis in the model (95), we assume as before that all the components
of ω, all the components of β and log σ are all i.i.d uniformly distributed on (−C, C) for a
large C. The posterior density is then given by
∥Y − X(ω)β∥2
 
−n−1
fω,β,σ|data (ω, β, σ) ∝ σ exp −
2σ 2
where we have ignored the indicator functions (assuming C is large). Often the main interest
is in the ω parameters. So we integrate the posterior with respect to β and σ. This gives
Z ∞
∥Y − X(ω)β∥2
Z  
−n−1
fω|data (ω) ∝ σ exp − dβdσ (96)
0 Rp 2σ 2

Now if β̂(ω) is the least squares estimator for fixed ω:


β̂(ω) := argmin ∥Y − X(ω)β∥2 ,
β

then using
∥Y − X(ω)β∥2 = ∥Y − X(ω)β̂(ω)∥2 + ∥X(ω)β − X(ω)β̂(ω)∥2
 T  
= ∥Y − X(ω)β̂(ω)∥2 + β − β̂(ω) X(ω)T X(ω) β − β̂(ω) ,

the integral (96) then becomes


Z ∞Z ! !
∥Y − X(ω) β̂(ω)∥ 2 (β − β̂(ω))X(ω) ′ X(ω)(β − β̂(ω))
σ −n−1 exp − exp − dβdσ
0 Rp 2σ 2 2σ 2
Z ∞ !Z !
∥Y − X(ω) β̂(ω)∥ 2 (β − β̂(ω))X(ω) ′ X(ω)(β − β̂(ω))
= σ −n−1 exp − exp − dβdσ.
0 2σ 2 Rp 2σ 2

We shall now use the formula:


Z  
1 T −1
p
exp − (x − µ) Σ (x − µ) dx1 . . . dxp = (2π)p/2 det(Σ)
Rp 2
where Σ is a p × p positive definite matrix and the integral is over x = (x1 , . . . , xp ). This is
basically the formula for the normalizing constant for the multivariate normal distribution.

This formula with Σ−1 = X(ω)′ X(ω)/(σ 2 ) (or equivalently Σ = σ 2 (X(ω)′ X(ω))−1 ) gives
!
(β − β̂(ω))T X(ω)T X(ω)(β − β̂(ω))
Z
exp − dβ
Rp 2σ 2
−1/2
= (2π)p/2 det (σ 2 (X(ω)′ X(ω))−1 ) = (2π)p/2 σ p det(X(ω)′ X(ω))
p
.

108
The required integral is
!

∥Y − X(ω)β̂(ω)∥2
Z
p/2 ′ −1/2 −n+p−1
(2π) (det(X(ω) X(ω))) σ exp − dσ.
0 2σ 2

The change of variable


σ
t=
∥Y − X(ω)β̂(ω)∥

then gives
!

∥Y − X(ω)β̂(ω)∥2
Z
p/2 ′ −1/2 −n+p−1
(2π) (det(X X)) σ exp − dσ
0 2σ 2
Z ∞  
p/2 ′ −1/2 −n+p −n+p−1 1
= (2π) (det(X(ω) X(ω))) ∥Y − X(ω)β̂(ω)∥ t exp − 2 dt
0 2t
∝ (det(X(ω)′ X(ω)))−1/2 ∥Y − X(ω)β̂(ω)∥−n+p .

Putting everything together, we have proved that

fω|data (ω) ∝ (det(X(ω)′ X(ω)))−1/2 ∥Y − X(ω)β̂(ω)∥−n+p .

This function of ω can be numerically analyzed when the dimension of ω is low. For example,
if ω is one-dimensional, we can plot the posterior density on the computer and normalize it
so the density integrates to one.

20.2 Logistic Regression

Here is another regression model which can be handled in a straightforward fashion by


probability theory. We are again in the usual regression setting where we observe data
(yi , xi1 , xi2 , . . . , xim ) for i = 1, . . . , n. There are m explanatory variables x1 , . . . , xm and one
response variable. xij denotes the value of the j th explanatory variable for the ith individual
and yi is the value of the response variable for the ith individual. Suppose now that the
response variable is binary i.e., y1 , . . . , yn take values in {0, 1}. In this case, the logistic
regression model assumes that:
 
independent exp (β0 + β1 xi1 + · · · + βm xim )
Yi ∼ Bernoulli for i = 1, . . . , n.
1 + exp (β0 + β1 xi1 + · · · + βm xim )

Letting xi = (1, xi1 , . . . , xim )T and β = (β0 , β1 , . . . , βm )T , we can write the model also as
 !
independent exp xTi β
Yi ∼ Bernoulli for i = 1, . . . , n.
1 + exp xTi β


Observe that xT1 , . . . , xTn form the rows of the n × p design matrix X (where p = m + 1). The
unknown parameters in the logistic regression model are β0 , . . . , βm (note that, in contrast
to the linear regression model, there is no σ parameter in logistic regression). In order to
use probability theory for inference on β0 , . . . , βm , we assume the prior:
i.i.d
β1 , . . . , βp ∼ Unif(−C, C) (97)

109
for a large C. The posterior of β is then

fβ|Y =y (β) ∝ fY |β (y)fβ (β)


  !yi  !1−yi 
n T T
Y exp xi β exp xi β
∝ T
 1− Tβ
  I{β1 , . . . , βp ∈ (−C, C)}
i=1
1 + exp x i β 1 + exp x i
" n #
Y exp(yi xT β)
i
= T β)
I{β1 , . . . , βp ∈ (−C, C)}
i=1
1 + exp(x i

= [exp (ℓ(β))] I{β1 , . . . , βp ∈ (−C, C)}

where
n
X
yi (xTi β) − log 1 + exp(xTi β) .
 
ℓ(β) :=
i=1

Note that ℓ(β) is simply the log-likelihood in this problem. The posterior density is not
in standard form. If p = 1 or p = 2, then this can be plotted. One can use various
MCMC techniques to obtain samples from this posterior. A closed form multivariate normal
approximation that works quite well in practice will be described in the next class. Bayesian
inference from this normal approximation to the posterior coincides with usual frequentist
inference for logistic regression.

21 Lecture Twenty One

21.1 Logistic Regression

Here is another regression model which can be handled in a straightforward fashion by


probability theory. We are again in the usual regression setting where we observe data
(yi , xi1 , xi2 , . . . , xim ) for i = 1, . . . , n. There are m explanatory variables x1 , . . . , xm and one
response variable. xij denotes the value of the j th explanatory variable for the ith individual
and yi is the value of the response variable for the ith individual. Suppose now that the
response variable is binary i.e., y1 , . . . , yn take values in {0, 1}. In this case, the logistic
regression model assumes that:
 
independent exp (β0 + β1 xi1 + · · · + βm xim )
Yi ∼ Bernoulli for i = 1, . . . , n.
1 + exp (β0 + β1 xi1 + · · · + βm xim )

Letting xi = (1, xi1 , . . . , xim )T and β = (β0 , β1 , . . . , βm )T , we can write the model also as
 !
independent exp xTi β
Yi ∼ Bernoulli for i = 1, . . . , n
1 + exp xTi β


where β is the (m + 1) × 1 vector with components β0 , β1 , . . . , βm .

Suppose xT1 , . . . , xTn form the rows of the n × p design matrix X (where p = m + 1). The
unknown parameters in the logistic regression model are β0 , . . . , βm (note that, in contrast
to the linear regression model, there is no σ parameter in logistic regression). In order to
use probability theory for inference on β0 , . . . , βm , we assume the prior:
i.i.d
β0 , β1 , . . . , βm ∼ Unif(−C, C) (98)

110
for a large C. The posterior of β is then

fβ|Y =y (β) ∝ fY |β (y)fβ (β)


  ! yi  !1−yi 
n T T
Y exp xi β exp xi β
∝ T
 1− Tβ
  I{β0 , β1 , . . . , βp ∈ (−C, C)}
i=1
1 + exp x i β 1 + exp x i
" n #
Y exp(yi xT β)
i
= T β)
I{β0 , β1 , . . . , βp ∈ (−C, C)}
i=1
1 + exp(x i

= [exp (ℓ(β))] I{β0 , β1 , . . . , βp ∈ (−C, C)}

where
n
X
yi (xTi β) − log 1 + exp(xTi β) .
 
ℓ(β) :=
i=1

Note that ℓ(β) is simply the log-likelihood in this problem. The posterior density is not
in standard form. If p = 1 or p = 2, then this can be plotted. One can use various
MCMC techniques to obtain samples from this posterior. A closed form multivariate normal
approximation that works quite well in practice can be found as follows. Bayesian inference
from this normal approximation to the posterior coincides with usual frequentist inference
for logistic regression. To get the normal approximation, first let us drop the indicator which
will be irrelevant when C is large to get

fβ|Y =y (β) ∝ exp(ℓ(β)).

The normal approximation will be obtained by a second-order Taylor expansion of ℓ(β). We


shall do the Taylor expansion around the MLE β̂ because the posterior is peaked at β̂ and
the high regions of the posterior are most likely very close to β̂. Recall that the MLE β̂ is
defined as the maximizer of the likelihood (or log-likelihood):

β̂ := argmax ℓ(β).
β∈Rp

It is obtained by taking the gradient of the log-likelihood and solving the equation obtained
by setting the gradient to zero. It is easy to check that the gradient of the log-likelihood is
n 
exp (x′i β)
X 
∇ℓ(β) = yi − xi . (99)
1 + exp (x′i β)
i=1

To get the maximum likelihood estimator θ̂, we need to set the gradient above to zero and
solve the resulting equation for θ. This cannot be done in closed form and the usual method
is to use an iterative scheme such as Newton’s algorithm. The answer can be obtained from
inbuilt functions in R or Python. More details behind the Newton algorithm will be provided
a bit later.

Coming back to the posterior exp(ℓ(β)), Taylor expansion of ℓ(β) around the MLE β̂ gives
 D E 1 
T
fβ|Y =y (β) ∝ exp(ℓ(β)) ≈ exp ℓ(β̂) + ∇ℓ(β̂), β − β̂ + (β − β̂) Hℓ(β̂)(β − β̂)
2

where Hℓ(β) denotes the Hessian of ℓ(β):


n
X exp(x′i β)
Hℓ(β) = − xi x′i .
i=1
(1 + exp(x′i β))2

111
Because the ℓ(β̂) term is a constant, it can be ignored in proportionality. Also ∇ℓ(β̂) equals
zero. We thus have
   
1 T 1 T
 
fβ|Y =y (β) ∝ exp (β − β̂) Hℓ(β̂)(β − β̂) = exp − (β − β̂) −Hℓ(β̂) (β − β̂)
2 2

We have switched above to −Hℓ(β̂) because this matrix is positive semi-definite as β̂ maxi-
mizes ℓ(β). The above term is simply the unnormalized density of the multivariate normal
distribution with mean β̂ and covariance matrix −Hℓ(β̂). Observe that
n
X exp(x′i β̂) ′
−Hℓ(β̂) =  2 x i x i .
i=1 1 + exp(x′i β̂)

Now let W denote the n × n diagonal matrix whose ith diagonal entry is

exp(x′i β̂)
 2
1 + exp(x′i β̂)

and also recall again that X is the n × p matrix with rows x′1 , . . . , x′n . It is then easy to check
that
−Hℓ(β̂) = X ′ W X. (100)
The posterior normal approximation is thus

N (β̂, (X ′ W X)−1 ). (101)

The standard errors corresponding to β0 , . . . , βm can then be obtained by the square roots
of the diagonal entries of (X ′ W X)−1 .

It turns out that Bayesian inference done with the above normal posterior approximation
(101) coincides with the frequentist inference in the logistic regression model. It is easy to
check this, say, in R (construct a 95% credible interval for, say, one of the components of
β and then compare it with the frequentist interval). Thus the usual frequentist inference
for the logistic regression model can be viewed from a Bayesian perspective. Note that the
above analysis relies on two assumptions: (a) the prior for β is assumed to be uniform on the
large cube (−C, C)p , and (b) the posterior is approximated by a normal distribution. These
assumptions may be of course not reasonable in a particular application. In such a situation,
it is conceptually very clear as to how one would proceed: if the normal approximation to the
posterior is not accurate, one needs to work with the actual posterior. If the uniform prior is
not reasonable, one can do the full posterior analysis (or by taking a normal approximation
to the posterior) for a more appropriate prior.

21.2 Details behind the Newton Algorithm for computing the MLE

The MLE β̂ of β is the maximizer of ℓ(β). The maximizer of ℓ(β) cannot be computed in
closed form. Newton’s method is commonly used for maximizing ℓ(β). Newton’s method
uses the iterative scheme
 −1
β (m+1) = β (m) − Hℓ(β (m) ) ∇ℓ(β (m) ) (102)

As was saw in (99),


n
X
∇ℓ(β) = (yi − πi ) xi
i=1

112
where πi is given by
exp(x′i β)
πi =
1 + exp(x′i β)
Letting π be the n × 1 vector with entries π1 , . . . , πn , we can write ∇ℓ(β) in matrix notation
as (note that X has rows xT1 , . . . , xTn or, equivalently, X T has columns x1 , . . . , xn ):

∇ℓ(β) = X T (Y − π)

where, as usual in regression, Y denotes the n × 1 vector of response values. Also from (100),
we can write
Hℓ(β) = −X T W X
where W is the n × n diagonal matrix whose ith diagonal entry is πi (1 − πi ). Newton’s
iterative scheme (102) therefore becomes

β (m+1) = β (m) + (X T W X)−1 X T (Y − π).

This can be rewritten as


β (m+1) = (X T W X)−1 X T W Z (103)
where
Z = Xβ (m) + W −1 (Y − π). (104)
The method of obtaining the MLE β̂ therefore proceeds iteratively as follows. First have an
initial estimate of β. Call this initial estimator β̂ (0) . Use this estimator to calculate pi via
(0) (0) (0)
exp(β̂0 + β̂1 xi1 + · · · + β̂p xip )
πi = (0) (0) (0)
.
1 + exp(β̂0 + β̂1 xi1 · · · + β̂p xip )

Use these values of πi to create the response variable values Zi via (104) and also use values
of πi to construct the matrix W . With Z and W , we can estimate β via

β̂ (1) = (X T W X)−1 X T W Z.

Now replace the initial estimator β̂ (0) by β̂ (1) and repeat this process. Keep repeating this
until two successive estimates β̂ (m) and β̂ (m+1) do not change much. At that point, stop and
report the estimate of β in the logistic regression model as β̂ (m) .

The expression (X T W X)−1 X T W Z is reminiscent of the usual (X T X)−1 X T Y which is


the usual estimate of β in the linear model. In fact, this is the least squares estimate in a
weighted least squares model.

22 Lecture Twenty Two

22.1 Linear Regression Recap

So far we have studied the linear regression model:


i.i.d
Y = Xβ + ϵ with ϵ1 , . . . , ϵn ∼ N (0, σ 2 )

under the prior that the components of β are i.i.d Unif(−C, C) for a large C. We have seen
(Problem 1(a) in Homework Five) that
 
β | data, σ ∼ Np β̂, σ 2 (X T X)−1 (105)

113
where p is the dimension of β. This is the posterior distribution of β conditional on σ. This
cannot be used for inference on β because σ is unknown. The posterior of β (without any
conditioning on σ) is given by (under the prior log σ ∼ Unif(−C, C))
 
β | data ∼ tn−p,p β̂, σ̂ 2 (X T X)−1 (106)

where σ̂ is the residual standard error. When n − p is large, the t-distribution will be quite
close to normal, so we can write
 
β | data is approximately Np β̂, σ 2 (X T X)−1 . (107)

Inference on β is done either using (106) or (107).

22.2 Linear Regression with Gaussian prior

It is common to consider other priors for linear regression. A general class of priors is given
by
β ∼ Np (m0 , Q0 ) (108)
for a mean vector m0 and covariance matrix Q0 . In this case, it can be shown (left as
exercise) that the posterior distribution of β conditional on σ is given by

β | data, σ ∼ Np (m1 , Q1 ) (109)

where
 −1    −1
−1 1 T −1 1 T −1 1 T
m1 = Q0 + 2 X X Q0 m0 + 2 X Y and Q1 = Q0 + 2 X X .
σ σ σ

(109) is the analogue of (105) for the Gaussian prior (108). Actually, the result (105) can
be seen as a special case of (109) when the prior covariance Q0 becomes large (think of
the setting where the smallest eigenvalue of Q0 goes to ∞). Because when Q0 approaches
infinity, it is easy to see that
 −1  
1 T 1 T
m1 = Q−1
+ 2X X
0
−1
Q0 m0 + 2 X Y
σ σ
 −1  
1 T 1 T
→ 2
X X X Y = (X T X)−1 X T Y = β̂
σ σ2

and also  −1


1 −1
Q1 = Q−1
0 + 2 XT X → σ2 X T X .
σ
As one concrete example of Q0 being large, think of Q0 = CIp when C is large. The posterior
for this prior is the same as (105) (i.e., there is no difference between the Unif(−C, C) and
Np (0, CIp ) priors) . The Gaussian prior result (109) can thus be seen as a generalization of
(105).

In many applications, it makes sense to work with the prior (108) as opposed to the
U nif (−C, C) prior. We shall illustrate in a real data setting in the next section.

114
22.3 Linear Regression on an Earnings Dataset

Consider the dataset ex1029 from the R package Sleuth3 (this package is written by Ramsey
and Schafer to accompany their introductory statistics book Statistical Sleuth). This dataset
contains weekly wages in 1987 for a sample of n = 25682 males between the ages of 18 and
70 who worked full-time along some covariates including theiry years of experience. We shall
work with the two variables:

y = response variable = log(weekly earnings)

and
x = years of experience.
We shall fit linear regression models of y on x. The reason for working with log(earnings) as
opposed to earnings directly is for better interpretation.

The most basic model between y and x is the usual linear model:

y = β0 + β1 x + ϵ.

From a visual examination of the scatterplot between y and x, it can be easily seen that this
simple linear regression model is not adequate as the relationship between y and x is clearly
nonlinear (y increases with x for small values of x and decreses with x for large values of x).
A more suitable model is
y = β0 + β1 x + β2 (x − s)+ + ϵ (110)
where s is also an unknown parameter. This model fits two lines which are connected at the
point s. The rate of change of y with x is β1 for x ≤ s and β1 + β2 for x > s. We have
previously seen how to fit models of the form (110) to data.

For this specific dataset, the model (110) is not suitable either because there is no reason
to just have one change of slope for the regression function. A more suitable model would
be
y = β0 + β1 x + β2 (x − s1 )+ + β3 (x − s2 )+ + · · · + βk+1 (x − sk )+ + ϵ
for a k that is not too small. There are two problems with working with this model:

1. It is difficult to fit it to the data unless k is very small. The methodology that we
studied in Lecture 20 involved marginalizing the β’s and σ to get a posterior only for
s1 , . . . , sk . This is a k-dimensional posterior and a grid-based method for selecting the
posterior mode or mean would be quite computationally challenging if k is not small.

2. It is difficult to select a suitable value of k.

One way to circumvent these problems is to introduce a change of slope term (x − s)+ at
every possible value of s. In this dataset, the variable x takes the values 0, 1, . . . , 63. So we
consider the model:

y = β0 + β1 x + β2 (x − 1)+ + β3 (x − 2)+ + · · · + β64 (x − 63)+ + ϵ (111)

This is the model that we shall work with. It is a linear regression model with 65 coefficients.
The intercept β0 is interpreted as the log(earnings) for someone who is starting their career
(x = 0). Also, for j = 1, . . . , 63, the term 100βj is interpreted as the percent change in
earnings when someone moves from (j −1) years of experience to j years in experience. [This
interpretation is actually incorrect; see the notes for the next lecture (Lecture
23) for the correct interpretation]

115
How to fit the model (111) to the observed data. The first approach is to work with the
uniform prior βj ∼ Unif(−C, C) for j = 0, 1, . . . , 63. This is equivalent to just doing the
usual linear regression (using the R function lm for instance). This gives the least squares
estimates β̂0ls , . . . , β̂63
ls and the function that we use for explaining the relationship between

earnings and experience is y = fˆls (x) where

fˆls (x) := β̂0ls + β̂1ls x + β̂2ls (x − 1)+ + β̂3ls (x − 2)+ + · · · + β̂64


ls
(x − 63)+ . (112)

In this particular dataset, this function turns out to be somewhat wiggly and not smooth,
which is not very interpretable. The situation is more pronounced when the number of
observations n not large. One can reduce the size of this dataset to, say, n = 500 by
sampling 500 observations (rows) at random from this dataset. One can then refit the least
squares estimate of y on x, (x − 1)+ , . . . , (x − 63)+ to this smaller dataset. Here the fitted
function (112) will be much more wiggly.

In order to obtain a smooth function fit to the data, one can use the prior
i.i.d
β0 ∼ N (0, C) and β1 , . . . , β64 ∼ N (0, τ 2 )

for a small τ . Here, if we take τ to be small, we are insisting on β1 , . . . , β64 to be small


which will lead to a smoother fit. The assumption β0 ∼ N (0, C) on β0 is very similar to
β0 ∼ Unif(−C, C) and it just says that we do not enforce anything on β0 a priori. We can
write this prior as
β ∼ N65 (m0 , Q0 )
where β is the 65 × 1 vector with components β0 , β1 , . . . , β64 , m0 is the 65 × 1 vector of zeros,
and Q0 is the 65 × 65 diagonal matrix with diagonal entries C, τ 2 , τ 2 , . . . , τ 2 . The posterior
distribution of β can then be calculated using (109) as:
 −1  −1 !
−1 1 T 1 T −1 1 T
β | data, σ ∼ N65 Q0 + 2 X X X Y, Q0 + 2 X X .
σ σ2 σ

This posterior can be used for inference on β. Note that it depends on τ and σ. One can
take a small value for τ if smooth function fit is desired. For σ, one can take a prior such as
log σ ∼ Unif(−C, C) and calculate the marginal posterior of β given the data alone (another
method is described in the next section). The posterior mean is
 −1
τ −1 1 T 1 T
β̃ = Q0 + 2 X X X Y
σ σ2
which can be used to get the function fit:

f˜τ (x) := β̃0τ + β̃1τ x + β̃2τ (x − 1)+ + β̃3τ (x − 2)+ + · · · + β̃64


τ
(x − 63)+ .

When τ is small, it can be checked that f˜τ (x) will be a very smooth function of x. If τ is
really really small, then f˜τ (x) will be essentially a constant.

This is a pretty straightforward methodology for fitting a smooth function of experience


to the log(earnings) data. However, the key issue is the choice of the tuning parameter τ .
Here is where probability gives a very nice solution. This is discussed next.

22.4 Choosing the tuning parameter τ

The choice of τ is quite crucial to this analysis. If τ is large, then the estimator β̃ τ will be
very similar to the least squares estimator β̂ ls so that the fitted function f˜τ will be quite

116
wiggly. On the other hand, if τ is extremely small, then the fitted function f˜τ will be
basically constant which would not be useful. So we our ideal choice for τ should be neither
too large nor too small. How do we make this choice?

Here is how probability theory solves this problem. We shall discuss choices for τ as well
as for σ which is also an unknown parameter that needs to be chosen in order to calculate
the estimate β̃ τ and f˜τ . The idea is simply to treat τ and σ as unknown parameters and
put priors on them. We shall use the prior:
i.i.d i.i.d
log τ ∼ Unif(−C, C) and log σ ∼ Unif(−C, C)

Note that this prior implies that we are allowing essentially (because C is large) all possible
values of τ and σ. In particular, we are not a priori ruling out large τ because we don’t like
wiggly solutions. We then compute the posterior of τ and σ as:

fτ,σ|data (τ, σ) ∝ fτ,σ (τ, σ)fdata|τ,σ (data). (113)

We need to calculate the likelihood term above which is the conditional distribution of the
data given τ and σ alone. As the model specifies the distribution of the data Y1 . . . . , Yn
in terms of β, we need to integrate out β to obtain the likelihood in terms of τ and σ.
Fortunately, this integral can be obtained in closed form because of the following result:

β ∼ Np (m0 , Q0 ) and Y | β ∼ Nn (Xβ, σ 2 In ) =⇒ Y ∼ N Xm0 , XQ0 X T + σ 2 In .




Therefore (note that for us m0 is the zero vector)


 n  
1 −1/2 1 T −1
fdata|τ,σ (data) = √ (det Σ) exp − Y Σ Y with Σ := XQ0 X T + σ 2 In .
2π 2

Recall that Q0 is the diagonal matrix with diagonal entries C, τ 2 , . . . , τ 2 . Throughout C is


a large constant (in the analysis of the Earnings data, I took C = 106 ). We can use this
likelihood in (113) to calculate the posterior of τ and σ. We can get a grid based discrete
appriximation of this posterior. Generally, this posterior will be peaked around the maximum
likelihood values τ̂ and σ̂ so one can obtain a simpler procedure by just taking τ and σ to
be the maximum likelihood values.

In the Earnings dataset, this procedure can be applied to the dataset of size 500 (randomly
sampled from the original dataset). One can also apply this to the full dataset but the matrix
inversions appearing above can be somewhat slow (some linear algebra tricks can be used to
make this implementable for larger n). This analysis leads to a fairly small value of τ̂ that
leads to a smooth function fit f˜τ̂ . There is something quite interesting here. There is a big
difference between the two likelihoods:

fdata|β,σ (data) and fdata|τ,σ (data).

Indeed, maximizing fdata|β,σ (data) leads to the least squares estimate β̂ ls which would be
quite wiggly. On the other hand, maximizing fdata|τ,σ (data) leads to a fairly small estimate
of τ̂ leading to a smooth function fit f˜τ̂ . The reason for this discrepancy can be understood
by noting that Z
fdata|τ,σ (data) = fdata|β,σ (data)fβ|τ (β)dβ.

When τ is large, the term fβ|τ (β) will be small simply because the normal density with
variance τ 2 will be flat for large τ . On the other hand, when τ is too small, the weight
fβ|τ (β) will be significant only for very smooth βs but these βs will have poor values for
fdata|β,σ (data).

117
Let me stress once more that this method for choosing τ does not a priori prefer small
values of τ . The marginal or integrated likelihood automatically selects a value of τ that is
small because it gives the best likelihood for the observed data.

It should be emphasized that the integrated likelihood fdata|τ,σ (data) really does not exist
in frequentist statistics so this method of tuning parameter selection is largely Bayesian.

22.5 Additional Comments and References

The method for regression with the prior N (0, τ 2 ) is very similar to Ridge Regression (https:
//en.wikipedia.org/wiki/Ridge_regression). Usually, the tuning parameter in Ridge
regression is selected via Cross-Validation which is a method that is quite different from
the above approach using the integrated or marginal likelihood. For a somewhat nuanced
discussion on the benefits of Bayesian tuning parameter selection and Cross Validation, see
http://www.inference.org.uk/mackay/Bayes_FAQ.html#cv.

If you want to read more into this approach for high-dimensional models, I strongly rec-
ommend the 1992 paper titled Bayesian Interpolation by David MacKay (MacKay gives a
short summary of this paper in this blog post: https://statmodeling.stat.columbia.
edu/2011/12/04/david-mackay-and-occams-razor/).

23 Lecture Twenty Three

23.1 Comments on the Coefficient Interpretation in Last Class’s Regression


Model

In the last class, we studied a regression model for y = log(Earnings) in terms of x =


Years of Experience. The interpretation I described for the coefficient parameters in that
model was incorrect. The correct interpretation is given below. The simplest model for y in
terms of x is the linear model which corresponds to the equation (without the error term):

y = β0 + β1 x. (114)

The interpretation of β0 and β1 in this model are very clear. β0 is the simply the value of
y = log(Earnings) when x = 0 (i.e., for someone just joining the workforce). To obtain the
interpretation for β1 , first plug in x = 1 in (114) and then x = 0 and subtract the second
equation from the first. This gives
 
E1 E1 − E0
β1 = y1 − y0 = log(E1 ) − log(E0 ) = log ≈ .
E0 E0
HEre E0 and E1 are Earnings for x = 0 and x = 1 respectively, and in the last equality, we
used log(u) ≈ u − 1 for u ≈ 1. Thus 100β1 represents the increment (in percent) in salary
from year 0 to year 1. For example, β1 = 0.05 means that the salary increases by 5% from
year 0 to year 1.

Now let us consider the model

y = β0 + β1 x + β2 (x − 1)+ (115)

Here the interpretation of β0 and β1 are exactly the same as for the model (114). β0 again
represents log(Earnings) for x = 0 and β1 represents the increment (in percent) from year 0

118
to year 1. What is the interpretation for β2 ? It is easy to see that:

log E2 = β0 + 2β1 + β2 log E1 = β0 + β1 log E0 = β0 .

Thus
E2 E1 E2 − E1 E1 − E0
β2 = log E2 − 2 log E1 + log E0 = log − log ≈ − .
E1 E0 E1 E0
Thus 100β2 represents the change in the percent increment between years 1 and 2 compared
to the percent increment between years 0 and 1. For example, β2 = 0.0003 means that the
percent increment decreases by 0.03 after year 2. If β1 = 0.05, we would have a 5% increment
after year 1 and a 5 − 0.03 = 4.97% increment after year 2.

Now consider the model that we actually used last time:

y = β0 + β1 x + β2 (x − 1)+ + β3 (x − 2)+ + · · · + β64 (x − 63)+ . (116)

Here the interpretation for β0 , β1 , β2 are just the same as in Model (115). More generally,
the interpretation for βj , j ≥ 2 is as follows: 100βj is the change in the percent increment
between years j − 1 and j compared to the percent increment between years j − 2 and j − 1.
For a concrete example, suppose

β0 = 5.74 β1 = 0.05 β2 = −0.0003 β3 = −0.0008 β4 = −0.001 ...

then

1. weekly earnings for someone just joining the workforce is exp(5.74) = $311.06,

2. increment after year 1 is 5%,

3. increment after year 2 is (5 − 0.03) = 4.97%,

4. increment after year 3 is (4.97 − 0.08) = 4.89%,

5. increment after year 4 is (4.89 − 0.1) = 4.79%, and so on.

If all βj , j ≥ 2 are negative, then, after a while, the increments may become negative which
means that the salary actually starts decreasing after a certain number of years of experience.

It should be clear from the above that β0 , β1 and βj , j ≥ 2 are different kinds of parameters
(they have different units for instance). In particular, we would expect βj , j ≥ 2 to be quite
small. In the last class, we analyzed model (116) with the prior
i.i.d
β0 ∼ N (0, C) and β1 , . . . , β64 ∼ N (0, τ 2 )

for a parameter τ > 0 (and large C). This analysis seems to treat β1 as well as βj , j ≥ 2 in
the same way. A better prior would be:
i.i.d
β0 ∼ N (0, C) β1 ∼ N (0, C) and β2 , . . . , β64 ∼ N (0, τ 2 ) (117)

23.2 Comments on Regularization

Parameter estimation for the model (116) using the prior (117) is an example of regulariza-
tion. Regularization is one of the most important ideas in statistics and machine learning in
the past 30 years. The reason for regularization is that, in models with lots of parameters
such as (116), standard (unregularized) estimation procedures give nonsensical answers. For

119
example, for the model (116) applied to 500 randomly selected observations from the full
ex1029 dataset from the R package Sleuth3, the usual least squares estimates are given by

β0 = 5.55 β1 = 0.045 β2 = 0.312 β3 = −0.431 β4 = 0.299 ...

The interpretation would then become

1. weekly earnings for someone just joining the workforce is exp(5.55) = $257.24,

2. increment after year 1 is 4.5%,

3. increment after year 2 is (4.5 + 31.2) = 35.7%,

4. increment after year 3 is (35.7 − 43.1) = −7.4%,

5. increment after year 4 is (−7.4 + 29.9) = 22.5%, and so on.

These increments fluctuate wildly so as to make these numbers nonsensical. This is the
reason why one regularizes while dealing with many parameters. In the context of the model
(116), regularization is done in the following ways:

1. The common approach: The common approach estimates β0 , . . . , βm (here m =


64) by minimizing the least squares criterion plus a penalty term which encourages
β2 , . . . , βm to be small. One way of doing this is via the minimization of
n
X
(yi − β0 − β1 xi − β2 (xi − 1)+ − β3 (xi − 2)+ − · · · − βm (xi − (m − 1))+ )2
i=1
+ λ β22 + · · · + βm
2


for a suitable tuning parameter λ. When λ is large, the minimizer of the above criterion
will have βj , 2 ≤ j ≤ m small. But if λ is too large, then βj , 2 ≤ j ≤ m will be very
close to zero. On the other hand, if λ is too small, then βj , 2 ≤ j ≤ m will be close
to the least squares estimates. The choice of λ is obviously quite crucial. For this,
one uses the idea that “Regularized estimates often lead to better predictions”. To use
this idea in practice for selecting λ, the original dataset is divided into two subsets:
training and test datasets. One would fit a regularized model to the training dataset
by minimizing the above criterion for each value of λ. The value of λ which minimizes
average prediction error on the test dataset would then be selected. This is the basic
idea underlying methods such as cross-validation. This approach is quite common
although there are some issues such as: (a) there are no principled approaches for
doing the training-test splits, and often different splits lead to different answers, and
(b) the value of λ leading to best predictions on the test dataset might lead to estimates
of βj ’s that still fluctuate quite a bit (although not as much as the unpenalized least
squares estimates).

2. The Probability/Bayesian approach: This is the approach that we discussed in


the last lecture. The starting point is the observation that the usual least squares
estimate can be seen as Bayesian estimates corresponding to the prior:
i.i.d
β0 , β1 , β2 , . . . , βm ∼ N (0, C) (118)

for a large C. To achieve regularization, one changes the above model to


i.i.d i.i.d
β0 , β1 ∼ N (0, C) and β2 , . . . , βm ∼ N (0, τ 2 ) (119)

where τ > 0 is an unknown parameter. The modeling assumption (119)√ is more flexible
than (118). Indeed, (119) includes (118) as a special case when τ = C. Using the

120
prior assumption (119), one can calculate the (marginal) likelihood of σ and τ by
integrating the conditional density of the data given β with respect to the prior (119):
Z
fdata|τ,σ (data) = fdata|β (data)fβ|τ,σ (β)dβ.

The best value of τ would then be obtained by maximizing this integrated likelihood
over τ and σ.

These two methods for performing regularization are actually quite different. There is no
training-test split in the Bayesian approach. On the other hand, there is no such thing as
integrated or marginal likelihood in the common approach. For a comparative discussion on
the benefits of the Bayesian approach, see http://www.inference.org.uk/mackay/Bayes_
FAQ.html#cv.

If you want to read more into the Bayesian approach for high-dimensional models, I
strongly recommend the 1992 paper titled Bayesian Interpolation by David MacKay (MacKay
gives a short summary of this paper in this blog post: https://statmodeling.stat.
columbia.edu/2011/12/04/david-mackay-and-occams-razor/).

24 Lecture Twenty Four

24.1 Central Limit Theorem (CLT)

The following is the simplest version of the CLT.

Theorem 24.1 (Central Limit Theorem). Suppose Xi , i = 1, 2, . . . are i.i.d with E(Xi ) = µ
and var(Xi ) = σ 2 < ∞. Then, with X̄n = (X1 + · · · + Xn )/n,

n X¯n − µ


σ
converges in distribution to N (0, 1). Convergence in distribution here means that

n X¯n − µ
 Z b
1 2
P{a ≤ ≤ b} → √ e−x /2 dx for all −∞ ≤ a < b ≤ +∞.
σ a 2π

Informally, the CLT says that for i.i.d observations X1 , . . . , Xn with finite mean µ and

variance σ 2 , the quantity n(X̄n − µ)/σ is approximately (or asymptotically) N (0, 1). In-
formally, the CLT also implies that

1. n(X̄n − µ) is approximately N (0, σ 2 ).

2. X̄n is approximately N (µ, σ 2 /n).

3. Sn = X1 + · · · + Xn is approximately N (nµ, nσ 2 ).

4. Sn − nµ is approximately N (0, nσ 2 ).

5. (Sn − nµ)/( nσ) is approximately N (0, 1).

It may be helpful here to note that

E(X̄n ) = µ and var(X¯n ) = σ 2 /n

121
and also
E(Sn ) = nµ and var(Sn ) = nσ 2 .
The most remarkable feature of the CLT is that it holds regardless of the distribution of Xi
(as long as they are i.i.d from a distribution F that has a finite mean and variance). Therefore
the CLT is, in this sense, distribution-free. To illustrate the fact that the distribution of Xi
can be arbitrary, let us consider the following examples.

1. Bernoulli: Suppose Xi are i.i.d Bernoulli random variables with probability of success
given by p. Then EXi = p and var(Xi ) = p(1 − p) so that the CLT implies that
√ p
n(X̄n − p)/ p(1 − p) is approximately N (0, 1). This is actually called De Moivre’s
theorem which was proved in 1733 before the general CLT. The general CLT stated
above was proved by Laplace in 1810.

The CLT also implies here that Sn is approximately N (np, np(1 − p)). We know that
Sn is exactly distributed according to the Bin(n, p) distribution. We therefore have the
following result: When p is fixed and n is large, the Binomial distribution Bin(n, p) is
approximately same as the normal distribution with mean np and variance np(1 − p).

2. Poisson: Suppose Xi are i.i.d P oi(λ) random variables. Then EXi = λ = var(Xi )
so that the CLT says that Sn = X1 + · · · + Xn is approximately Normal with mean
nλ and variance nλ. It is not hard to show here that Sn is exactly distributed as a
P oi(nλ) random variable (proved later). We deduce therefore that when n is large and
λ is held fixed, P oi(nλ) is approximately same as the Normal distribution with mean
nλ and variance nλ.

3. Gamma: Suppose Xi are i.i.d random variables having the Gamma(α, λ) distribution.
Check then that EXi = α/λ and var(Xi ) = α/λ2 . We deduce then, from the CLT, that
Sn = X1 +· · ·+Xn is approximately normally distributed with mean nα/λ and variance
nα/λ2 . We derived previously that Sn is exactly distributed as Gamma(nα, λ). Thus
when n is large and α and λ are held fixed, the Gamma(nα, λ) is approximately closely
by the N (nα/λ, nα/λ2 ) distribution according to the CLT.

4. Chi-squared. Suppose Xi are i.i.d chi-squared random variables with 1 degree of


freedom i.e., Xi = Zi2 for i.i.d standard normal random variables Z1 , Z2 , . . . . It is
easy to check then that Xi is a Gamma(1/2, 1/2) random variable. This gives that
X1 + · · · + Xn is exactly Gamma(n/2, 1/2). This exact distribution of X1 + · · · + Xn is
also called the chi-squared distribution with n degrees of freedom (denoted by χ2n ). The
CLT therefore implies that the χ2n distribution is closely approximated by N (n, 2n).

5. Cauchy. Suppose Xi are i.i.d standard Cauchy random variables. Then Xi ’s do not
have finite mean and variance. Thus the CLT does not apply here. In fact, it can be
proved here that (X1 + · · · + Xn )/n has the Cauchy distribution for every n. A sketch
of this proof is given later in this lecture.

24.2 CLT Proof strategy

To prove the CLT, the natural idea is to write down some sort of formula for
 √ 
n(X̄n − µ)
P a≤ ≤b
σ

and then see what happens to the formula as n → ∞. The problem with this approach
is that the formula is a bit tricky to write (it depends on whether the random variables

122
are discrete or continuous, for example). Even if we assume that the random variables are
discrete, the formula is a bit messy. For example, suppose that X1 , X2 , . . . are i.i.d discrete
random variables taking the values 0, 1, 2, . . . . Then
 √ 
n(X̄n − µ)
P a≤ ≤b
σ
 √ √
= P nµ + aσ n ≤ X1 + · · · + Xn ≤ nµ + bσ n
X  √ √
= px1 px2 . . . pxn I nµ + aσ n ≤ x1 + · · · + xn ≤ nµ + bσ n
x1 ,...,xn :xj ∈{0,1,2,... }

where pj := P{X1 = j}. Understanding the behaviour of this as n gets large is a bit tricky.

For this reason, while dealing with sums of independent random variables, people usually
work with certain transforms of distributions.

24.3 Transforms

There are three commonly used transforms: z-transform (also known as Probability Gener-
ating Function) for random variables taking values in {0, 1, 2, . . . }, Laplace transform (also
known as Moment Generating Function), and the Fourier transform (also known as the
Characteristic Function).

24.3.1 z-Transform (Probability Generating Function)

Suppose X is a discrete random variable taking the values 0, 1, 2, . . . . The z-transform of X


is defined as

X
X
z j P{X = j}.

GX (z) := E z =
j=0

Here is a simple example.

Example 24.2. Suppose X takes the values 8, 13, 20, 29, 35 with probabilities 0.3, 0.2, 0.15, 0.3, 0.05.
Then the z-transform of X is given by

GX (z) = 0.3z 8 + 0.2z 13 + 0.15z 20 + 0.3z 29 + 0.05z 35 .

In general GX (z) will be a complicated power series with many terms. In some cases
however, the series corresponding to GX (z) can be summed explicitly to yield a simpler
formula for GX (z) as in the next example.

Example 24.3. Suppose X has the Poisson distribution with mean λ. Then
∞ ∞ ∞
X X λj X (λz)j
GX (z) = z j P{X = j} = z j e−λ = e−λ = e−λ eλz = exp (λ(z − 1)) .
j! j!
j=0 j=0 j=0

Note that GX (z) uniquely determines the distribution of X because

dk
P{X = k} = GX (z)
dz k z=0

123
for every k = 0, 1, 2, . . . .

The z-transform (and all other transforms) have the important property that the transform
for the sum of n i.i.d random variables is simply the transform of the individual random
variable raised to power n. This is because:

GX1 +...,+Xn (z) = E z X1 +···+Xn




= E z X1 z X2 . . . z Xn = Ez X1 Ez X2 . . . Ez Xn = (GX1 (z))n .
   

The utility of this result for dealing with sums of i.i.d random variables can be found in the
following two examples.

Example 24.4. Suppose X1 , . . . , X100 are i.i.d with common distribution giving probabilities
0.3, 0.2, 0.15, 0.3, 0.05 to the values 8, 13, 20, 29, 35 with probabilities. What is the distribution
of X1 + · · · + X100 ?

X1 + · · · + X100 will be a discrete random variable taking many possible values. Specifying
its distribution via the probability mass function will be tedious. But it is very easy to write
down its z-transform as
100
GX1 +···+X100 (z) = (GX1 (z))100 = 0.3z 8 + 0.2z 13 + 0.15z 20 + 0.3z 29 + 0.05z 35 .

Example 24.5. The following is a fundamental fact. The sum of n independent random
variables X1 , . . . , Xn with Xi ∼ P oi(λi ) for i = 1, . . . , n equals P oi(λ1 + · · · + λn ). This can
be very easily proved using z-transforms (and the fact that the z-transform of P oi(λ) equals
exp(λ(z − 1))) because

GX1 +···+Xn (z) = GX1 (z)GX2 (z) . . . GXn (z)


= exp (λ1 (z − 1)) exp (λ2 (z − 1)) . . . exp (λn (z − 1))
= exp ((λ1 + . . . λn )(z − 1)) .

Thus the z-transform of X1 + · · · + Xn coincides with that of P oi(λ1 + · · · + λn ) which implies


that X1 + · · · + Xn has the P oi(λ1 + · · · + λn ) distribution.

While the z-transform makes dealing with independent sums convenient, it is not a general
tool as it is only defined for discrete random variables taking the values 0, 1, 2, . . . . This is
the reason for considering Laplace and Fourier transforms.

24.3.2 Laplace Transform (Moment Generating Function)

The Laplace transform of a random variable X is defined as the function:

MX (t) := E etX


for all t ∈ (−∞, ∞) for which E(etX ) < ∞. Note that MX (0) = 1.

Example 24.6 (MGF of Standard Gaussian). If X ∼ N (0, 1), then its MGF can be easily
computed as follows:
Z ∞ Z ∞
−(x − t)2
 
1 2 1 2
E(etX ) = √ etx e−x /2 dx = √ exp exp(t2 /2)dx = et /2 .
2π −∞ 2π −∞ 2
2 /2
Thus MX (t) = et for all t ∈ R.

124
The Laplace transform is defined for every random variable X (although for some random
variables MX (t) can be +∞ for many values of t) unlike the z-transform which is only
defined for random variables taking values in {0, 1, 2, . . . }. Just like the z-transform, the
Laplace transform also factorizes for independent random variables. Indeed, if X1 , . . . , Xn
are independent, then
MX1 +···+Xn (t) = MX1 (t)MX2 (t) . . . MXn (t).
This is a consequence of the fact that
n n
!
Y Y
t(X1 +···+Xn ) tXi
Ee =E e = EetXi ,
i=1 i=1

the last equality being a consequence of independence.

The Laplace transform is known as the Moment Generating Function because it allows
one to easily read off the moments of the random variable. For k ≥ 1, the number E(X k ) is
called the k th moment of X. Knowledge of MX (t) allows one to easily read off the moments
of X because the power series expansion of MX (t) is
∞ k
X t
MX (t) = EetX = E(X k ).
k!
k=0

Therefore the k th moment of X is simply the coefficient of tk in the power series of expansion
of MX (t) multiplied by k!. Alternatively, one can derive the moments E(X k ) as derivatives
of the MGF at 0 because
dk
 k 
(k) tX d tX 
k tX

MX (t) = k E(e ) = E e = E X e
dt dtk
so that
(k)
MX (0) = E(X k ).
In words, E(X k ) equals the k th derivative of MX at 0. Therefore
′ ′′
MX (0) = E(X) and MX (0) = E(X 2 )
and so on.

As an application, we can deduce the moments of the standard normal distribution from
2
the fact that its Laplace Transform equals et /2 . Indeed, because

2 /2
X t2i
et = ,
2i i!
i=0

it immediately follows that the k th moment of N (0, 1) equals 0 when k is odd and equals
(2j)!
when k = 2j.
2j j!
The Central Limit Theorem can be established using Laplace transforms in the following
way.

Proof of the CLT with Laplace Transforms. We have i.i.d random variables X1 , X2 , . . . which

have mean µ and finite variance σ 2 . Let Yn := n(X̄n − µ)/σ. We need to show that Yn con-
verges in distribution to N (0, 1). We shall show that the Laplace transform of Yn converges
2
of the Laplace transform of N (0, 1) which is et /2 :
2 /2
MYn (t) → et for every t.

125
Note that
n
√ X̄n − µ 1 X Xi − µ
Yn = n =√ .
σ n σ
i=1
As a result,
MYn (t) = MPi (Xi −µ)/(√nσ) (t)
n
Y  n
= MPi (Xi −µ)/σ (tn−1/2 ) = M(Xi −µ)/σ (tn−1/2 ) = M (tn−1/2 )
i=1

where M (·) is the Laplace transform of (X1 − µ)/σ. We now use Taylor’s theorem to expand
M (tn−1/2 ) up to a quadratic polynomial around 0. Recall that Taylor’s theorem says that
for a function f and two points x and p in the domain of f , we can write
f ′′ (p) f (r) (p) f (r+1) (ξ)
f (x) = f (p) + f ′ (p)(x − p) + (x − p)2 + · · · + (x − p)r + (x − p)r+1
2! r! (r + 1)!
where ξ is some point that lies between x and p. Using Taylor’s theorem with r = 1,
x = tn−1/2 and p = 0, we obtain
t t2 ′′
M (tn−1/2 ) = M (0) + √ M ′ (0) + M (sn )
n 2n
for some sn that lies between 0 and tn−1/2 . This implies therefore that sn → 0 as n → ∞.
Note now that M (0) = 1 and M ′ (0) = E((X1 − µ)/σ) = 0. We therefore deduce that
n
t2 ′′

MYn (t) = 1 + M (sn ) .
2n
Note also that
 2
′′ ′′ X1 − µ
M (sn ) → M (0) = E =1 as n → ∞.
σ
We therefore invoke the following fact:
 an n
lim 1 + = ea provided lim an = a (120)
n→∞ n n→∞

to deduce that n
t2 ′′

2
MYn (t) = 1 + M (sn ) → et /2 = MN (0,1) (t).
2n
This completes the proof of the CLT assuming the fact (120). It remains to prove (120).
There exist many proofs for this. Here is one. Write
 an  n   an 
1+ = exp n log 1 + .
n n
Let ℓ(x) := log(1 + x). Taylor’s theorem for ℓ for r = 2 and p = 0 gives
x2 x2
ℓ(x) = ℓ(0) + ℓ′ (0)x + ℓ′′ (ξ) =x−
2 2(1 + ξ)2
for some ξ that lies between 0 and x. Taking x = an /n, we get
an a2
ℓ(an /n) = log(1 + (an /n)) = − 2 n
n 2n (1 + ξn )2
for some ξn that lies between 0 and an /n (and hence ξn → 0 as n → ∞). As a result,
a2n
 
 a n n   an 
1+ = exp n log 1 + = exp an − → ea
n n 2n(1 + ξn )2
as n → ∞. This proves (120).

126
The above proof of the CLT has two deficiencies:

1. It tacitly assumes that the moment generating function of X1 , . . . , Xn exists for all
t. This is much stronger than the existence of the variance of Xi (which is all the
CLT needs). Indeed if MX (t) exists for all t in any open interval containing zero, then
moments of all orders (not just the variance) exist.

2. We have proved that the Laplace transform of n X̄nσ−µ converges to that of N (0, 1).
It is not clear though as to how this implies that
Z b
√ X̄n − µ
 
1 2
P a≤ n ≤b → √ e−x /2 dx
σ a 2π
for −∞ ≤ a < b ≤ ∞.

To fix these two deficiencies, one works with the Fourier transform for proving the CLT.

24.3.3 Fourier Transform (Characteristic Function)

The Fourier transform of a random variable X is defined as the function:

ϕX (t) := E eitX = E cos(tX) + iE sin(tX)





for all t ∈ (−∞, ∞). Here i = −1. The Fourier transform is defined for every random
variable and it is finite for all t ∈ (−∞, ∞). This is because cos(tX) and sin(tX) are always
bounded by 1 so the expectation will obviously be finite. For example, suppose X has the
Cauchy distribution with density:
1
fX (x) := .
π(1 + x2 )

Then it is easy to check that the Laplace transform MX (t) will equal +∞ for all t ̸= 0. On
the other hand, the Fourier transform of X is
Z ∞
eitx
ϕX (t) = EeitX = 2
dx.
−∞ π(1 + x )

It turns out that the above integral equals e−|t| so that

ϕX (t) = e−|t| .

You should find a proof of the above online. Just like the other two transforms, the Fourier
transform factorizes for independent random variables. Indeed, if X1 , . . . , Xn are indepen-
dent, then
ϕX1 +···+Xn (t) = ϕX1 (t)ϕX2 (t) . . . ϕXn (t).
This is a consequence of the fact that
 
n
Y n
Y
Eeit(X1 +···+Xn ) = E  eitXj  = EeitXj ,
j=1 j=1

the last equality being a consequence of independence.

127
Example 24.7. The following is a standard fact. If X1 , X2 , . . . , Xn are i.i.d random vari-
ables having the Cauchy distribution. Then their mean X̄n := (X1 + · · · + Xn )/n also has the
Cauchy distribution. This can be proved using Fourier transforms as follows. The Fourier
transform of X̄n equals
ϕX̄n (t) = EeitX̄n
      n
t t t
= E exp i (X1 + · · · + Xn ) = ϕX1 +···+Xn = ϕX1
n n n
Because the Fourier transform of the Cauchy distribution equals e−|t| , we get
ϕX̄n (t) = (exp (−|t|/n))n = exp(−|t|).
Thus the Fourier transform of X̄n also equals e−|t| which implies that X̄n also has the Cauchy
distribution.

In the next class, we shall study the proof of the CLT using the Fourier transform.

25 Lecture Twenty Five

25.1 Recap: Last Class

In the last lecture, we looked at the Central Limit Theorem:

Theorem 25.1 (Central Limit Theorem). Suppose Xi , i = 1, 2, . . . are i.i.d with E(Xi ) = µ
and var(Xi ) = σ 2 < ∞. Then, with X̄n = (X1 + · · · + Xn )/n,
√ 
n X̄n − µ
σ
converges in distribution to N (0, 1). Convergence in distribution here means that
√  Z b
n X̄n − µ 1 2
P{a ≤ ≤ b} → √ e−x /2 dx for all −∞ ≤ a < b ≤ +∞.
σ a 2π

We also discussed the usefulness of transforms for proving the CLT, and proved it using
the Laplace Transform (MGF) by showing that
√ 
t2 /2 n X̄n − µ
MYn (t) → MN (0,1) (t) = e where Yn :=
σ
This proof suffers from the following two deficiencies:

1. It tacitly assumes that the moment generating function of X1 , . . . , Xn exists for all
t. This is much stronger than the existence of the variance of Xi (which is all the
CLT needs). Indeed if MX (t) exists for all t in any open interval containing zero, then
moments of all orders (not just the variance) exist.

2. We have proved that the Laplace transform of n X̄nσ−µ converges to that of N (0, 1).
It is not clear though as to how this implies that
Z b
√ X̄n − µ
 
1 2
P a≤ n ≤b → √ e−x /2 dx
σ a 2π
for −∞ ≤ a < b ≤ ∞.

To fix these two deficiencies, the Fourier transform (or Characteristic Function) is used to
prove the CLT.

128
25.2 CLT proof via the Fourier Transform

Recall, from the last lecture, that the Fourier transform of a random variable X is defined
as the function:
φX (t) := E eitX = E cos(tX) + iE sin(tX)


for all t ∈ (−∞, ∞). Here i = −1. The Fourier transform is defined for every random
variable and it is finite for all t ∈ (−∞, ∞). This is because cos(tX) and sin(tX) are
always bounded by 1 so the expectation will obviously be finite. Just like the Laplace trans-
form, the Fourier transform also factorizes for independent random variables. Specifically, if
X1 , . . . , Xn are independent, then
φX1 +···+Xn (t) = φX1 (t)φX2 (t) . . . φXn (t).
As a consequence, if X1 , . . . , Xn are i.i.d,
φX1 +···+Xn (t) = (φX1 (t))n .
A sketch of the proof of the CLT via the Fourier transform is provided below. I will skip
some technical details and provide only the high level ideas. Full details can be found, for
example, in Chapter 6 of the book A Course in Probability Theory by Kai Lai Chung.

Fourier Proof of CLT. We have i.i.d random variables X1 , X2 , . . . which have mean µ and

finite variance σ 2 . Let Yn := n(X̄n − µ)/σ. We shall first prove that the Fourier transform
of Yn converges to that of N (0, 1):
φYn (t) → φN (0,1) (t) for every t.
The Fourier transform of N (0, 1) equals
Z ∞ Z ∞  
itx 1 −x2 /2 −t2 /2 1 1 2
φN (0,1) (t) = e √ e dx = e √ exp − (x − it) dx = e−t /2 .
2
−∞ 2π −∞ 2π 2
We will show that
2 /2
φYn (t) → e−t for every t.
Because
n
√ X̄n − µ 1 X Xi − µ
Yn = n =√ ,
σ n σ
i=1
we can write,
φYn (t) = φPi (Xi −µ)/(√nσ) (t)
n
Y  n
−1/2
= φP i (Xi −µ)/σ
(tn )= φ(Xi −µ)/σ (tn−1/2 ) = φ(tn−1/2 )
i=1

where φ(·) is the Fourier transform of (X1 − µ)/σ. We now use Taylor’s theorem to expand
φ(tn−1/2 ) up to a quadratic polynomial around 0:
t t2 ′′
φ(tn−1/2 ) ≈ φ(0) + √ φ′ (0) + φ (0).
n 2n
It is easy to check that, for every random variable R, we have φR (0) = Eei(0)R = 1. Also
d itR deitR
φ′R (t) = = E iReitR

Ee = E
dt dt
d d(iReitR )
φ′′R (t) = φ′R (t) = E = E (iR)2 eitR .

dt dt

129
Plugging in t = 0, we get

φ′R (0) = E(iR) = iER and φ′′R (0) = E(iR)2 = i2 E(R2 ) = −E(R2 ).

Using R = (X1 − µ)/σ (which has mean zero and unit variance) and φ = φR , we get

φ(0) = 1 and φ′ (0) = 0 and φ′′ (0) = −1.

Therefore
n  n
 n  t t2 ′′ t2 2
φYn (t) = φ(tn−1/2 ) ≈ φ(0) + √ φ′ (0) + φ (0) = 1− → e−t /2
n 2n 2n

as n → ∞. This proves that the Fourier transform of Yn converges to that of N (0, 1). From
here, we now need to deduce that

P{a ≤ Yn ≤ b} → P {a ≤ N (0, 1) ≤ b} (121)

This statement is equivalent to

EI[a,b] (Yn ) → EI[a,b] (N (0, 1))

where I[a,b] (x) := I{a ≤ x ≤ b} is the indicator function of the interval [a, b]. How does this
follow from the convergence of the Fourier transforms:

EeitYn → Eeit(N (0,1)) .

The idea is that Fourier analysis guarantees that the indicator function can be represented
as a linear combination of the complex functions eitx . This representation is of the form:
Z ∞
I[a,b] (x) = eitx g(t)dt
−∞

for some function g. From here, one can write


Z ∞ 
itYn
EI[a,b] (Yn ) = E e g(t)dt
−∞
Z ∞
E eitYn g(t)dt

=
Z−∞

= φYn (t)g(t)dt
Z−∞∞
→ φN (0,1) (t)g(t)dt
−∞
Z ∞   Z ∞ 
itN (0,1) itN (0,1)
= E e g(t)dt = E e g(t)dt = EI[a,b] (N (0, 1))
−∞ −∞

which proves (121). For a rigorous version of this argument, see Chapter 6 of the book A
Course in Probability Theory by Kai Lai Chung.

25.3 Closing Thoughts

In this class, we took the view that probability is a general and principled method of reasoning
under uncertainty. Here is a quote by Jaynes (from one of his papers in 1957): the purpose
of any application of probability theory is simply to help us in forming reasonable judgements

130
in situations where we do not have complete information. Hopefully, this course convinced
you that probability theory applies to many problems that are commonly studied in the
fields of statistics and machine learning (what is usually called “Bayesian Statistics” is just
Probability Theory). I would like to leave you with a couple of quotes by Laplace (from the
book A Philosophical Essay on Probabilities by Pierre Simon Laplace) glorifying probability
theory. Laplace was one of the founders of probability theory and Bayesian statistics:

1. It is remarkable that a science, which commenced with a consideration of the games


of chance, should be elevated to the rank of the most important subjects of human
knowledge.

2. If we consider

a) the analytical methods to which this theory has given birth;

b) the truth of the principles which serve as a basis;

c) the fine and delicate logic which their employment in the solution of problems
requires;

d) the establishments of public utility which rest upon it;

e) the extension which it has received and which it can still receive by its application
to the most important questions of natural philosophy and the moral science;

if we consider again

a) that, even in the things which cannot be submitted to calculus, it gives the surest
hints which can guide us in our judgements, and

b) that it teaches us to avoid the illusions which offtimes confuse us,

then we shall see that there is no science more worthy of our meditations, and that no
more useful one could be incorporated in the system of public instruction.

References
[1] Jaynes, E. T, (2003) Probability theory: the logic of science. Cambridge University Press,
2003.

[2] Sinai, Y. G, (1992) Probability theory: an introductory course. Springer Verlag, 1992.

[3] DeGroot, M. H. (1986) A conversation with David Blackwell. Statistical Science, 1,


40–53.

[4] Mosteller, F, (1987) Fifty challenging problems in probability with solutions. Courier
Corporation, 1987.

[5] MacKay, D. J. C., (2003) Information theory, inference, and learning algorithms. Cam-
bridge University Press, 2003.

131

You might also like