[go: up one dir, main page]

0% found this document useful (0 votes)
28 views5 pages

Conditional Distributions

The document discusses conditional distributions, highlighting the differences between discrete and continuous random variables, and the importance of conditional densities in defining conditional distributions. It also covers Jensen's inequality, its applications in statistics, and introduces concentration inequalities, specifically Hoeffding's inequality, which provides bounds on the probability that sample means deviate from the population mean. The text emphasizes the utility of conditional distributions and concentration inequalities in statistical modeling and analysis.

Uploaded by

Juwon Daniel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views5 pages

Conditional Distributions

The document discusses conditional distributions, highlighting the differences between discrete and continuous random variables, and the importance of conditional densities in defining conditional distributions. It also covers Jensen's inequality, its applications in statistics, and introduces concentration inequalities, specifically Hoeffding's inequality, which provides bounds on the probability that sample means deviate from the population mean. The text emphasizes the utility of conditional distributions and concentration inequalities in statistical modeling and analysis.

Uploaded by

Juwon Daniel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Conditional distributions

Conditional distributions in general are rather abstract. When the random variables in question
are discrete (µ = counting measure), however, things are quite simple; the reason is that events
where the value of the random variable is fixed have positive probability, so the ordinary
conditional probability formula involving ratios can be applied.
When one or more of the random variables in question are continuous (dominated by Lebesgue
measure), then more care must be taken. Suppose random variables X and Y have a joint
distribution with density function pX,Y (x, y), with respect to some dominating (product)
measure µ×ν. Then the corresponding marginal distributions have densities with respect to µ and
ν, respectively, given by
pX(x) = Z pX,Y (x, y) dν(y) and pY (y) = Z pX,Y (x, y) dµ(x).
Moreover, the conditional distribution of Y , given X = x, also has a density with respect to ν,
and is given by the ratio
pY |X(y | x) = pX,Y (x, y)/pX(x).
As a function of x, for given y, this is clearly µ-measurable since the joint and marginal densities
are measurable. Also, for a given x, pY |X(y | x) defines a probability measure Qx, called the
conditional distribution of Y , given X = x, through the integral
Qx(B) = Z B pY |X(y | x) dν(y).
That is, pY |X(y | x) is the Radon–Nikodym derivative for the conditional distribution Qx. For
our purposes, conditional distribution can always be defined through its conditional density
though, in general, a conditional density may not exist even if the conditional distribution Qx
does exist. There are real cases where the most general definition of conditional distribution
(Keener 2010, Sec. 6.2) is required, e.g., in the proof of the Neyman–Fisher factorization
theorem and in the proof of the general Bayes theorem. Also, I should mention that conditional
distributions are not unique: the point being that the conditional density can be redefined
arbitrarily on a set of ν-measure zero, without affecting the integral that defines Qx(B) above. We
will not dwell on this point here, but students should be aware of the subtleties of conditional
distributions; the wikipedia page4 on the Borel paradox gives a clear explanation of these
difficulties, along with references, e.g., to Jaynes (2003), Chapter 15.
Given conditional distribution with density pY |X(y | x), we can define conditional probabilities
and expectations. That is,
P(Y ∈ B | X = x) = Z B pY |X(y | x) dν(y).
Here I use the more standard notation for conditional probability. The law of total probability
then allows us to write
P(Y ∈ B) = Z P(Y ∈ B | X = x)pX(x) dµ(x),
in other words, marginal probabilities for Y may be obtained by taking expectation of the
conditional probabilities. More generally, for any ν-integrable function ϕ, we may write the
conditional expectation
E{ϕ(Y ) | X = x} = Z ϕ(y)pY |X(y|x) dν(y).
We may evaluate the above expectation for any x, so we actually have defined a (µ-measurable)
function, say, g(x) = E(Y | X = x); here I took ϕ(y) = y for simplicity. Now, g(X) is a random
variable, to be denoted by E(Y | X), and we can ask about its mean, variance, etc. The
corresponding version of the law of total probability for conditional expectations is
E(Y ) = E{E(Y | X)}. (1.6)
This formula is called smoothing in Keener (2010) but I would probably call it a law of iterated
expectation. This is actually a very powerful result that can simplify lots of calculations; Keener
(2010) uses this a lot. There are versions of iterated expectation for higher moments, e.g.,
V(Y ) = V{E(Y | X)} + E{V(Y | X)}, (1.7)
C(X, Y ) = E{C(X, Y | Z)} + C{E(X | Z), E(Y | Z)}, (1.8)
where V(Y | X) is the conditional variance, i.e., the variance of Y relative to its conditional
distribution and, similarly, C(X, Y | Z) is the conditional covariance of X and Y .
As a final word about conditional distributions, it is worth mentioning that conditional
distributions are particularly useful in the specification of complex models. Indeed, it can be
difficult to specify a meaningful joint distribution for a collection of random variables in a given
application. However, it is often possible to write down a series of conditional distributions that,
together, specify a meaningful joint distribution. That is, we can simplify the modeling step by
working with several lower-dimensional conditional distributions. This is particularly useful for
specifying prior distributions for unknown parameters in a Bayesian analysis; we will discuss
this more later.
Jensen’s inequality
Convex sets and functions appear quite frequently in statistics and probability applications, so it
can help to see the some applications. The first result, relating the expectation of a convex
function to the function of the expectation, should be familiar.
Theorem 1.6 (Jensen’s inequality). Suppose ϕ is a convex function on an open interval X ⊆ R,
and X is a random variable taking values in X. Then
ϕ[E(X)] ≤ E[ϕ(X)].
If ϕ is strictly convex, then equality holds if and only if X is constant. Proof. First, take x0 to be
any fixed point in X. Then there exists a linear function `(x) = c(x − x0) + ϕ(x0), through the
point (x0, ϕ(x0)), such that `(x) ≤ ϕ(x) for all x. To prove our claim, take x0 = E(X), and note that
ϕ(X) ≥ c[X − E(X)] + ϕ[E(X)].
Taking expectations on both sides gives the result.
Jensen’s inequality can be used to confirm: E(1/X) ≥ 1/E(X), E(X2 ) ≥ E(X) 2 , and E[log X] ≤
log E(X). An interesting consequence is the following.
Example 1.7 (Kullback–Leibler divergence). Let f and g be two probability density functions
dominated by a σ-finite measure µ. The Kullback–Leibler divergence of g from f is defined as
Ef {log[f(X)/g(X)]} = Z log(f/g)f dµ.
It follows from Jensen’s inequality that
Ef {log[f(X)/g(X)]} = −Ef {log[g(X)/f(X)]}
≥ − log Ef [g(X)/f(X)]
= − log Z (g/f)f dµ = 0.
That is, the Kullback–Leibler divergence is non-negative for all f and g. Moreover, it equals zero
if and only if f = g (µ-almost everywhere). Therefore, the Kullback–Leibler divergence acts like
a distance measure between to density functions. While it’s not a metric in a mathematical sense5
, it has a lot of statistical applications. See Exercise 23.
Example 1.8 (Another proof of Cauchy–Schwartz). Recall that f 2 and g 2 are µ-measurable
functions. If R g 2 dµ is infinite, then there is nothing to prove, so suppose otherwise. Then p = g
2/ R g 2 dµ is a probability density on X. Moreover, R fg dµ R g 2 dµ 2 = Z (f/g)p d2 ≤ Z (f/g) 2
p dµ = R f 2 dµ R g 2 dµ , where the inequality follows from Theorem 1.6. Rearranging terms
one gets Z fg dµ2 ≤ Z f 2 dµ · Z g 2 dµ, which is the desired result.
Another application of convexity and Jensen’s inequality will come up in the decisiontheoretic
context to be discussed later. In particular, when the loss function is convex, it will follow from
Jensen’s inequality that randomized decision rules are inadmissible and, hence, can be ignored.
A concentration inequality
We know that sample means of iid random variables, for large sample sizes, will “concentrate”
around the population mean. A concentration inequality gives a bound on the probability that the
sample mean is outside a neighborhood of the population mean. Chebyshev’s inequality
(Exercise 25) is one example of a concentration inequality and, often, these tools are the key to
proving limit theorems and even some finite-sample results in statistics and machine learning.
Here we prove a famous but relatively simple concentration inequality for sums of independent
bounded random variables. By “bounded random variables” we mean Xi such that P(ai ≤ Xi ≤
bi) = 1. For one thing, boundedness implies existence of moment generating functions. We start
with a simple result for one bounded random variable with mean zero; the proof uses some
properties of convex functions. Portions of what follows are based on notes prepared by Larry
Wasserman.6
Lemma 1.1. Let X be a random variable with mean zero, bounded within the interval [a, b].
Then the moment generating function MX(t) = E(e tX) satisfies
MX(t) ≤ e t 2 (b−a) 2/8 .
Proof. Write X = W a + (1 − W)b, where W = (X − a)/(b − a). The function z 7→ e tz is convex,
so we get
e tX ≤ W eta + (1 − W)e tb .
Taking expectation, using the fact that E(X) = 0, gives
MX(t) ≤ − a b − a e ta + b b − a e tb .
The right-hand side can be rewritten as e h(ζ) , where
ζ = t(b − a) > 0, h(z) = −cz + log(1 − c + cez ), c = −a/(b − a) ∈ (0, 1).
Obviously, h(0) = 0; similarly, h 0 (z) = −c + cez/(1 − c + cez ), so h 0 (0) = 0. Also,
h 00(z) = c(1 − c)e z (1 − c + cez ) 2 , h000(z) = c(1 − c)e z (1 − c − cez ) (1 − c + cez ) 3 .

1/4, and this is the global maximum. Therefore, h 00(z) ≤ 1/4 for all z > 0. Now, for some z0 ∈
It is easy to verify that h 000(z) = 0 iff z = log( 1−c c ). Plugging this z value in to h 00 gives

(0, ζ), there is a second-order Taylor approximation of h(ζ) around 0: h(ζ) = h(0) + h 0 (0)ζ + h
00(z0) ζ 2 2 ≤ ζ 2 8 = t 2 (b − a) 2 8 .
Plug this bound in to get MX(t) ≤ e h(ζ) ≤ e t 2 (b−a) 2/8 .
Lemma 1.2 (Chernoff). For any random variable X, P(X > ε) ≤ inft>0 e −tεE(e tX). Proof. See
Exercise 26.
Now we are ready for the main result, Hoeffding’s inequality. The proof combines the results in
the two previous lemmas.
Theorem 1.7 (Hoeffding’s inequality). Let Y1, Y2, . . . be independent random variables, with
P(a ≤ Yi ≤ b) = 1 and mean µ. Then
P(|Y¯ n − µ| > ε) ≤ 2e −2nε2/(b−a) 2 .
Proof. We can take µ = 0, without loss of generality, by working with Xi = Yi − µ. Of course, Xi
is still bounded, and the length of the bounding interval is still b − a. Write
P(|X¯ n| > ε) = P(X¯ n > ε) + P(−X¯ n > ε).
Start with the first term on the right-hand side. Using Lemma 1.2, P(X¯ n > ε) = P(X1 + · · · +
Xn > nε) ≤ inf t>0 e −tnεMX(t) n ,
where MX(t) is the moment generating function of X1. By Lemma 1.1, we have P(X¯ n > ε) ≤
inf t>0 e −tnεe nt2 (b−a) 2/8 .
The minimizer, over t > 0, of the right-hand side is t = 4ε/(b − a) 2 , so we get
P(X¯ n > ε) ≤ e −2nε2/(b−a) 2 .
To complete the proof, apply the same argument to P(−X¯ n > ε), obtain the same bound as
above, then sum the two bounds together.
There are lots of other kinds of concentration inequalities, most are more general than
Hoeffding’s inequality above. Exercise 28 walks you through a concentration inequality for
normal random variables and a corresponding strong law. Modern work on concentration
inequalities deals with more advanced kinds of random quantities, e.g., random functions or
stochastic processes. The next subsection gives a special case of such a result.

You might also like