[go: up one dir, main page]

0% found this document useful (0 votes)
26 views103 pages

Lecture06 BasicSampleGeneration CMU

This document discusses Monte Carlo methods and their applications, focusing on sample generation and random number generation techniques. It covers numerical and computational issues, including floating point errors, compensated summation, and challenges in parallel implementations. The document also explores the nature of randomness, pseudorandom number generators, and measures of randomness relevant to Monte Carlo simulations.

Uploaded by

Ge Cao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views103 pages

Lecture06 BasicSampleGeneration CMU

This document discusses Monte Carlo methods and their applications, focusing on sample generation and random number generation techniques. It covers numerical and computational issues, including floating point errors, compensated summation, and challenges in parallel implementations. The document also explores the nature of randomness, pseudorandom number generators, and measures of randomness relevant to Monte Carlo simulations.

Uploaded by

Ge Cao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 103

MONTE CARLO

METHODS
AND APPLICATIONS

Carnegie Mellon University | 21-387 | 15-327/627 | 15-860 | Fall 2023


LECTURE 6
SAMPLE GENERATION

MONTE CARLO METHODS


AND APPLICATIONS
Carnegie Mellon University | 21-387 | 15-327/627 | 15-860 | Fall 2023
“Jedenfalls bin ich überzeugt, daß “God does not play dice with the
der nicht würfelt.” (In any case, I universe, but I will.”
am convinced that [God] does not
roll the dice.) —WALTER MATTHAU
—ALBERT EINSTEIN playing Einstein in the 1994
in a letter to Max Born (1926) romantic comedy “I.Q.”
Overview
• Some quick comments about numerics & computation
• Random number generation
• Low-discrepancy sequences
• Sampling special distributions
• General techniques for discrete distributions
• General techniques for continuous distributions
– inversion
– rejection sampling
Numerics & Computation
Just in case you forgot… :-)

basic Monte Carlo estimator


Taking Large Sums
• Easy task for a computer: summing up a bunch of
numbers. N sum
harmonic
series
• Let’s give it a try: consider the harmonic series 1 1
10 2.92897
• Approximate by taking first N terms, for large N
100 5.18738
• Seems to converge to about 15.4037 1k 7.48548
10k 9.78761
• Q: Anyone know what it should converge to? 😅
100k 12.0909
• A: Infinity! (diverges) 1M 14.3574
float sum = 0.f;
for( int k = 1; k <= N; k++ ) 10M 15.4037
{
What went sum += 1.f/(float)k;
100M 15.4037
wrong? } 1B 15.4037
Taking Large Sums—Part II
• Easy task for a computer: summing up a bunch of
numbers.
• Suppose we now have a convergent sum, and want
to compute it in parallel on a GPU.
• The first phase is easy: if we have N terms in our
sum and M threads, just ask each thread to evaluate
N/M terms and add them up.
• On a recent GPU, M can be something like 160,000.
🤯
• So now we have 160k partial sums stored in
different places. How do we sum them up without
negating the benefit of running the first phase in
parallel?
Taking Large Products
• Easy task for a computer: multiplying P = 1
a big list of numbers for i in range(500):
P *= a*b
• Example. Let a = 10, b = 1/10
print( P )
• Q: What do you get if you multiply # output: 1.0
abababab… 500 times?
– A: 1.0 P = 1
for i in range(500):
• Q: What do you get if you multiply P *= a
aaaaa… 500 times, then bbbbb… 500 for i in range(500):
times? P *= b
What does this print( P )
– A: NaN even mean…? # output: NaN
Numerical & Computational Issues
• In practice, lots to consider beyond formal proofs of convergence
• Floating point
– computers don’t actually operate on real numbers!
• Parallel implementation
float sum = 0.f;
for( int k = 1; k <= n; k++ )
– basic Monte Carlo is “embarrassingly parallel” {
sum += 1.f/(float)k;
}
– but, can get tripped up by behavior of integrand
• In general, relationship between theory & practice is a two-way street
– Practical obstacles may require changes to mathematical formulation
– E.g., different sampling strategies may be better suited to different
architectures (CPU vs. GPU vs. distributed over network vs. …)
Floating Point Numbers
• How do we approximate real number computation so that it’s fast & accurate?
• Several solutions: fixed point, floating point, rational arithmetic, …
• Floating point by far the most common for Monte Carlo methods
– fixed number of bits of storage (32/64 = “single/double precision”)
– exponent takes some bits sign mantissa (a) exponent (b)
+
– leading term/“mantissa” takes the rest –

x=±a 2b

Of course, represents only a small subset of the reals:

0
Floating Point—Representable Numbers (2D)
Errors in Floating Point
• Sequences of arithmetic operations can lose running
total
information/accuracy
• Can also go out of range
• Many common misconceptions, e.g.:
smallest number representable with same
– “dividing by small numbers is bad for accuracy” exponent as running total

– simply not true!


(ignored bits)
• Division/multiplication ⟹ overflow/underflow
• Addition/subtraction ⟹ cancellation error (bits in
smaller term ignored)
• Often have to be more careful about addition/
subtraction
– Monte Carlo is a big sum!
Compensated Summation
• A variety of ways to deal with cancellation in large sums
• One simple strategy: compensated summation
float sum = 0.f; // running total
float c = 0.f; // cancellation error from most recent addition

for( int k = 1; k <= n; k++ )


{
float y = (1.f/(float)k) - c; // subtract previous error from this term
float t = sum + y; // compute a temporary sum
c = (t - sum) - y; // compute the cancellation error
sum = t; // set the new sum to the temporary sum
}

Key idea: pile up “little bits” until they’re big enough to add to the sum.
Harmonic Series—revisited
• Let’s try our harmonic series again, this
time with compensated summation harmonic
N sum
series 1 1
• This time, the partial sum keeps 10 2.92897
increasing, as expected 😮💨 100 5.18738
1k 7.48548

float sum = 0.f; // running total


10k 9.78761
float c = 0.f; // cancellation error from most recent addition
100k 12.0909
for( int k = 1; k <= n; k++ )
1M 14.3927
{
float y = (1.f/(float)k) - c; // subtract previous error from this term 10M 16.6953
float t = sum + y; // compute a temporary sum
c = (t - sum) - y; // compute the cancellation error 100M 18.9978
sum = t; // set the new sum to the temporary sum
} 1B 21.3004
Log-sum-exp trick
• Now that we have a handle on mitigating
cancellation error in large sums, how do we P = 1
for i in range(500):
avoid underflow/overflow in large products? P *= a
• One way: carefully order multiplication so you for i in range(500):
tend to multiply “small terms” with “large terms” P *= b
print( P )
– e.g., sort list of terms, multiply pairs with # output: NaN
roughly reciprocal magnitudes, repeat…
– that’s a lot of effort—and only works if we Might lose
know all our terms ahead of time! S = 0 a little
for i in range(500): accuracy—
– Easier way: log-sum-exp S += log(a) but much
better than
– take logarithm of each term for i in range(500):
failure!
S += log(b)
– sum logarithms print( exp(S) )
#output: 1.0000000000001608
– exponentiate final sum
Parallel Implementation—Challenges
• At first glance, Monte Carlo seems “embarrassingly parallel”: just ask each thread/
core/node/etc., to evaluate its fair share of the terms in the sum.
• Several problems, though:

1. (Easier.) The partial sums have to be collected and aggregated into the final sum,
which means there’s some communication overhead.
2. (Harder.) Each summand may take a very different amount of work, or use
divergent memory accesses, making load balancing and cache management
challenging.
3. (Also harder.) Distributing MCMC doesn’t help with mixing time. 😞

• Some well-established solutions for, e.g., parallel sum/parallel prefix sum… the other
problems are much harder! (And problem-specific.)
Parallel Implementation—Challenges
• At first glance, Monte Carlo seems “embarrassingly parallel”: just ask each thread/
core/node/etc., to evaluate its fair share of the terms in the sum.
• Several problems, though:

1. (Easier.) The partial sums have to be collected and aggregated into the final sum,
which means there’s some communication overhead.
2. (Harder.) Each summand may take a very different amount of work, or use
divergent memory accesses, making load balancing and cache management
challenging.
3. (Also harder.) Distributing MCMC doesn’t help with mixing time. 😞

• Some well-established solutions for, e.g., parallel sum/parallel prefix sum… the other
problems are much harder! (And problem-specific.)
Random Number Generation
Random Number Generation—Overview
• Our most basic task will be to sample uniform random numbers
– canonically on interval [0,1]; can be transformed to other distributions
• Lots of questions:
– what does “random” really mean?
– do random numbers really exist?
– how can we simulate numbers that appear random?
– how do we measure/test quality of random numbers?
– what features of randomness are important for Monte Carlo?
• Will try to at least get a brief glimpse at some answers!
How do you generate (good) random numbers?

(This is probably not the way.)


Is this a good random number?

47,692

Randomness is a property of collections of numbers…


(discussion cribbed from Avi Wigderson)
Is a coin toss random?
• Suppose I have a coin that can land “heads” or
“tails” with equal probability
• Q: At best, how often can you predict which way
it will land?
• A: 50% of the time (just always guess “heads”)
• Q: Suppose I now let you take as many
measurements as you want right before I flip it,
and do as much computation as you want?
• A: Probably around 100% of the time…
Key idea: randomness is not an objective property—
depends on how much information and
computational power you have.
Randomness & Determinism
One perspective: if all phenomena in the universe follow deterministic
laws (e.g., Newton’s laws of motion) then with enough computational
power you can always predict exactly what will happen.

If everything is predictable, can anything be truly “random”?


Quantum Indeterminacy & Randomness
Another perspective: aren’t some (e.g., quantum mechanical) phenomena in
the universe inherently random?

…maybe? But what if I have quantum information & quantum computation?


Answer: I don’t know—ask Ryan O’Donnell!
Honestly, this stuff does not matter for Monte Carlo*.

We don’t need to simulate “perfect” random numbers!

We merely want to take advantage of useful properties


of random numbers.

(But which ones?)

*…but if you like this kind of thing, there’s a great Abel Prize lecture by Avi Wigderson
Good vs. Bad Random Numbers
• Does the following list of numbers seem like a “good” or “bad” random
sequence for use in Monte Carlo methods?
– 2, 5, 8, 31, 32, 36, 54, 56, 59, 60, 68, 71, 72, 78, 82, 89
• How about this one—is this a “better” or “worse” sequence?
– 31, 36, 8, 5, 82, 89, 72, 60, 54, 2, 71, 32, 56, 78, 68, 59
• If we use these numbers to generate sample points for Monte Carlo
integration, does order even matter? Key idea: “quality” of
random numbers depends
– or just distribution over the sample space? on what you want to use
them for!
Good vs. Bad Random Numbers
Are these “good” random numbers / points?

How can you tell which ones are best suited for Monte Carlo?
“Deterministic random numbers…?”

“Any one who considers arithmetical methods of producing


random digits is, of course, in a state of sin.”

— JOHN VON NEUMANN (1951)


Pseudorandom Number Generation
• A pseudorandom number generator (PRNG) is a deterministic user-provided, or based
algorithm for producing a sequence of numbers x1, x2, … that on time, “entropy” like
“behave like” uniform random numbers cursor motion…
– e.g., iterative application of function F(x) to initial “seed” x1
• Many criteria we might care about for Monte Carlo methods:
– statistical randomness — want algorithms to match theory
– long periods — once a PRNG repeats, subsequent integration is useless!
– cheap to compute — we need a lot of them! Note: a l l P R N G s
n i t e p e r i o d ,
have a fi
– parallel “streams” — doesn’t help if all threads compute same thing! s e t h e y h a v e
becau
e r n a l s t a t e !
– repeatability — especially useful for debugging :-) finite int

– computationally hard to predict cryptography, not Monte Carlo!


Measures of Randomness
• Many possible ways to measure “randomness" “good” PRNG

• E.g., statistical tests:


– (1969) Knuth — Art of Computer Programming
– (1996) DIEHARD battery of tests
– (2007) TestU01 library
• Basically checks whether PRNG agrees with
established (and usually arcane…) facts about “bad” PRNG
behavior of random numbers
– e.g., gaps between random numbers on an interval
should follow exponential distribution
– e.g., number of random points that land in regular
grid cells should follow multinomial distribution
The dark past of random number generators…
1 9 6 0 s , a t r u l y
“…in t h e l a t e
a n d o m n u m b e r “Monte Carlo simulations
ho r r i b l e r on [a model] for which the
A N D U w a s answers are known have s exact
t o r c a l l e d R hown that ostensibly high
genera t o f t h e random number generator -quality
l y u s e d o n m o s s may lead to subtle, but d
common systematic errors … ramatic
p u t e r s … ”
world’s com
a l d K n u t h …these results offer a ster
—Don n warning … a specific alg
o f C o m p u t e r be tested together with the orithm must
The Art random number generator
n g ( Vo l . 2 ) regardless of the tests whic being used,
Programmi h the generator has passed
.”

—Ferrenberg, Landau, &


Wong (1992)
First Random Number Generator
1
• Earliest PRNG was perhaps
suggested by John von Neumann:
– square the current number; take the
middle 10 digits; repeat 0
distribution of digits
• At first glance, looks ok…
• Big problem: when we hit “0”, just pairs as points
repeats “0” forever! Wisdom: when
designing a
random numbe
r generator,
don’t just try so
• More generally: many ad-hoc mething at
random!
schemes get stuck in short cycles
Linear Congruential Generators (LCG)
repeats after
• A linear congruential generator is a 320 points! (“bad” parameters)
sequence of numbers obtained by iteratively
applying a linear map & taking the modulus
– xk+1 = mod (axk + b, m) f
l l w h i c h o
Can you te i s L C G a = 123
– a, b, m are integers, with a, b < m the two be l o w
i s a “ h i g h b = 456
and which G ?
• Surprisingly easy and common way to quali t y ” P R N m = 123456
generate “ok but not great” quality
pseudorandom numbers (e.g., used provided in
C++11!
historically for rand() in many mainstream
systems—like the gcc C++ compiler!)
– need to carefully choose a, b, m
– various conditions (e.g., Hull–Dobell a = 48271
theorem) ensure good behavior Mersenne b=0
31
Twister m = 2 -1
Digits of π
• Already used random numbers to compute π—how
about using π to compute random numbers? 😁
• Conjecture: π is a “normal number,” i.e., for every k ≥ 1,
every string of digits of length k appears with equal
frequency (in any base b)
• Bonus: infinite period/never repeats (π is irrational)
– caveat: must still store index of digit (finite memory)
– Downsides:
– (much) more expensive than other PRNGs
– loading precomputed values is not cache friendly
– fun idea, but there are better ways…
digits of π
Mersenne Twister
• Mersenne Twister often cited as a “good” PRNG since:
– it’s found in many popular languages (Python, C++,
R, PHP, …)
Q: Ok, but how do
– it has a cool name es it work?
A: There’s a truly m
arvelous
• On the other hand: description, which
this box is
– it’s not particularly fast too small to contain

– it fails more recent statistical tests of randomness
(e.g., some from TestU01)
– for unlucky seeds, can be a long “burn in” time
before achieving good statistical randomness
– bad for parallel Monte Carlo, since using different
seeds on each thread still yields correlation

See: Vigna, “It Is High Time We Let Go Of The Mersenne Twister” (2019)
Permuted Congruential Generator
• A permuted congruential generator (PCG) is the complete implementation (C++)
most recent hope for a “good” PRNG
#include <stdint.h>

– passes all the statistical tests uint64_t state = 0x4d595df4d0f33173;


uint64_t const multiplier = 6364136223846793005u;
uint64_t const increment = 1442695040888963407u;

– fast, long periods, works well in parallel… uint32_t rotr32(uint32_t x, unsigned r) {


return x >> r | x << (-r & 31);

• How does it work? }

uint32_t pcg32(void) {

– Basically just a LCG! mod(ax + b, m) uint64_t x = state;


unsigned count = (unsigned)(x >> 59);
state = x * multiplier + increment;

– But with a twist: ordinary LCG exhibits short x ^= x >> 18;


return rotr32((uint32_t)(x >> 27), count);
period in low-order bits. So, use high-order bits }

to “shuffle” low-order bits void pcg32_init(uint64_t seed) {


state = seed + increment;
(void)pcg32();
– shuffling means things like XOR or rotating bits }

1 0 0 1 0 0 1 1
Python: numpy.random.PCG64
Comparison of PRNGs
From https://www.pcg-random.org/

Take with a grain of salt—“marketing material” for PCG!


Can AI do better?
• A: I don’t know… maybe?
• Here’s what I get if I ask ChatGPT for random numbers:

“Please give me 64
two-dimensional
points sampled from
the uniform
distribution on the
unit square.”
Stratified Random Numbers
• We discussed stratified sampling as method of
variance reduction
f (x)
• Easy win: why not “bake in” (1D) stratification to
random number generator itself?
• Caveat: to avoid bias, must always request & use
numbers in batches of a known size x

– e.g., initialize PRNG with a maximum sample size,


and stop stratifying after this number has been
reached
• Not something you’d ever do for cryptographic
applications… great for Monte Carlo! :-)
Low-Discrepancy Sequences
Low-Discrepancy Sequences
• Also discussed quasi-Monte Carlo when discussing variance reduction
• Here, replace random numbers with low-discrepancy sequences
– e.g., Halton or Hammersley sequences
• Algorithmically, generation of low-discrepancy points not much different
from pseudorandom number generators (even similar cost)
– goal is different: achieve low discrepancy, rather than modeling uniform
distribution
• How do we actually generate these points algorithmically?
(see Owen chapter on Quasi Monte Carlo for more in-depth treatment)
Review: Discrepancy
• What makes a set of sample points
PN := {p1, …, pN} “good”?
1
• One feature we might like: number of points
covered by a “random” region is proportional
to the area of the region.
• Several ways to formalize—e.g., for points P in
d
unit cube Cd := [0,1] , B pi
B ∩ PN
D(PN ) := sup − vol(B) for some
B∈J N 0
family of subsets J of Cd 0 1

– e.g., J could be all intervals, half intervals, …


Low Discrepancy Sequences—Warm Up
• How can we deterministically generate “well-spaced” points on [0,1]?
• One idea: just subdivide [0,1] into equal size intervals (midpoint rule!)
– unfortunately not progressive…
• Instead, suppose we recursively subdivide interval into pieces:

e l a rg e s t How do we d
“fill in th o this
algorithmical
remain i n g g a p ” ly?

8 4 12 2 10 6 14 1 9 5 13 3 11 7 15
Radical Inverse
• Many low-discrepancy sequences built up using radical inverse ϕb(i), also known
(more or less…) as the van der Corput sequence
– pick a base b, “reflect” digits of integer i across decimal place
∞ k ∞ −1−k
– more precisely, if i = ∑k=0 ikb , then ϕb(i) := ∑i=0 ikb
• Example. Consider integer i = 6. In base 2 (binary), i has digits 110, i.e., two plus
four. Radical inverse in binary is 0.011. In decimal, then, the base 2 radical inverse
of 6 is one fourth plus one eighth, or 1/4 + 1/8 = 3/8

base 10: 4 2 1 1/2 1/4 1/8 4 2 1 1/2 1/4 1/8


ϕb
base 2: 1 1 0.0 0 0 0 0 0.0 1 1
Halton Sequence
• Halton sequence is sequence of d-dimensional points
xi := (ϕb1(i), …, ϕbd(i)) where bases bk are relatively prime
avoids correlation
– most typically: b1, …, bd are just the first d primes between dimensions
– e.g., in 2D xi = (ϕ2(i), ϕ3(i))

b2, b3 = 2,2 b2, b3 = 2,3 b2, b3 = 2,4 b2, b3 = 2,5


Halton sequences are naturally stratified
a b
E.g., in 2D if we generate N = 2 3 Halton points, will fall into
a b
x-intervals of size 1/2 , y-intervals of size 1/3 .
1 1 1

0 0 0
0 1 0 1 0 1
a=2, b=1 a=3, b=2 a=5, b=3
Hammersley Sequence
• Hammersley sequence also builds low-

Halton
discrepancy sample points x1, …, xn from
radical inverse
– in particular, xi1 := (i − 1)/n, and
xij := (ϕi−1, pj) for remaining j, where pj is
the jth prime number

Hammersley
– achieves lower discrepancy than Halton
– …but, requires that total number of
samples n be fixed in advance
Special Distributions
Special Distributions—Overview
• Many continuous distributions beyond uniform come up repeatedly:
– normal, exponential, …
– or uniform distribution on other domains (sphere, simplex, …)
• Well-known formulas for sampling from these
• Basic approach:
– sample uniform random numbers
– apply closed-form expression that “warps” uniform to nonuniform
• Can more generally apply warping to any source of randomness
– stratified, low-discrepancy, …
• Also need to sample discrete distributions (integers, permutations…)
Sampling Integers
• Q: Suppose I have a random number 0%7 = 0 5%7 = 5
generator rand() that provides integers 1%7 = 1 6%7 = 6 0, 1, 2 appea
uniformly sampled from the interval [0, M]. 2%7 = 2 7%7 = 0 twice as ofte r
n!
How can I get a random integer in the 3%7 = 3 8%7 = 1
range [0, a), where a < M? 4%7 = 4 9%7 = 2
# uniformly sample integer
• A: Just compute rand()%a. wrong # from the range [0,a-1],
# assuming rand() gives
• No! “Obvious” strategy doesn’t sample # uniform values in [0,M]
uniformly. def uniformRand( a ):
m = a * floor(M/a)
while True:
• Q: Why not? :-) x = rand()
if x < m:
• A: Consider the case M = 10, a = 7… return x % a
right
• Q: How can we get it right? (two-word j u s t c a l l a
Or
answer) e d ! ) l i b r a r y
(trust
• A: Rejection sampling. function
Sampling Real Values
• As you might imagine, you also have to
think carefully about numerical issues
when sampling other distributions (e.g.,
floating point values in [0,1]).
• Lots of stuff written on how to do it
right…
• Rule of thumb: trust, but verify!

“The conventional method for generating pseudorandom floating point values is to divide a
pseudorandom integer by a floating-point constant. The problems with this approach are (1) the
possible outcomes are only about 7% of representable floating point values (2) subsequent
transformations may not yield the expected distributions (e.g., applying a log transformation to a
uniform distribution does not yield an exponential distribution).”
Sampling a Random Permutation
• For many algorithms, need to sample random
permutations Ω 1
η
– e.g., Latin hypercube sampling 2 5
• Recall that a permutation of a discrete set Ω is a
bijection η : Ω → Ω
3 4
– intuitively, a “shuffling” of the elements

• For a set with n elements, there are n! distinct


permutations
Short a n s w e r : d r a w
• Q: How do we sample permutations uniformly at “tiles” 1– n r a n d o m l y
random, i.e., such we have a 1/n! chance of from a b a g u n t i l
sampling each distinct permutation? none remain!
Slow Shuffle
• One algorithm: starting with the list Ω
Ω = (1,…, n) pick a number k uniformly 1 2 3 4 5
from 1,…, | Ω | ; remove the kth element
from Ω; repeat until Ω is empty.
k: 3 1 2 3 4 5
k: 2 1 2 3 4 5
• Q: How much does this algorithm cost?
k: 3 1 2 3 4 5
• A: To locate/remove the kth element
takes O(n) time; we repeat this process n k: 1 1 2 3 4 5
2
times ⟹ O(n )
k: 1 1 2 3 4 5
– same cost for array, linked list, …
• Q: How can we do better? 3 2 5 1 4
Fisher-Yates Shuffle
• Fisher-Yates shuffle*:
Ω
– start with list Ω = (1,…, n)
1 2 3 4 5
– iterate i from left to right
– swap element i with element j uniformly 4 2 3 1 5
sampled from i, …, n
• Q: What’s the cost?
4 3 2 1 5
• A: O(n): one swap, one call to PRNG per iteration
• Q: Why does this generate uniformly sampled
permutations? 4 3 5 1 2
• A: Each time, we uniformly sample from the
remaining values (same as “slow shuffle”)
4 3 5 1 2
*Warning: assuming 1-based indexing!
Sampling Continuous Distributions
One approach to generating samples from a continuous distribution:
– sample points from the uniform distribution
– apply “warping” function to get target distribution
Warping other collections of samples
Can of course apply warping to any other collection of samples:

uniform stratified Halton blue noise

May or may not preserve strengths of a given sampling strategy


(e.g., points may no longer be nicely stratified after warping).
Review: Normal Distribution
A random variable X on the real line (Ω = ) is normally distributed with
mean µ and standard deviation σ if its probability density function is a
Gaussian of the form

integrates to 1 mean
standard
deviation

eas i l y g e n e r a l i z e s
to n dimensions
Sampling from the Normal Distribution
The Box-Muller transformation provides a simple way to warp uniformly-
2
distributed samples (u, v) ∈ [0,1] into normally-distributed ones:

each component follows


1D normal distribution
Box-Muller Distribution
• Box-Muller gives samples from the standard normal distribution,
with mean μ = 0 and standard deviation σ = 1
– to get other distributions, we can just scale & shift

def SampleNormal( μ, σ ):
u = uniform(0,1) * 2*pi
v = uniform(0,1)
return σ*sqrt(v)*cos(u) + μ

histogram of 10k samples, for μ=5, σ=2


Multivariate Normal Distribution
• In 2D, the standard multivariate normal
1 −(x 2+y 2)/2
p(x, y)
distribution given by p(x, y) = e

y
• Q: How can we sample from it?
• A: Sample x, y from two independent 1D
standard normal distributions x
• Q: Why does this work?

• A: One way to see it: marginal distribution
∫−∞
1 −x 2/2
with respect to both x and y is standard 1D p(x, y) dy = e

normal distribution ensional
To s a m p l e d - d i m ∞

∫−∞
distributio n , j u s t s a m p l e 1 −y 2/2
p(x, y) dx = e
X1, …, Xd ∼ 𝒩0,1 2π
Sampling the Unit Circle
• Plenty of other standard / simple
distributions we need to sample from
• Q: Easy example: how do we draw
samples uniformly from the unit circle
1 2
S := {x ∈ ℝ : ∥x∥ = 1}?
• A: Just sample a value u uniformly from
[-1,1], then evaluate (cos(πu), sin(πu))
• Q: How do you same from [-1,1]? :-)
• A: Sample [0,1], multiply by 2, subtract 1
Sampling a 2D disk
• How do we get uniform samples on the
2 2
unit disk D := {x ∈ ℝ : ∥x∥ ≤ 1}?
wrong
• First idea: uniformly sample r, u from
(nonuniform)
[0,1], then plug in to r(cos(2πu)sin(2πu))
• Doesn’t work!
– too many samples near the origin…
right
• Simple modification:
(uniform)
u(cos(2πv), sin(2πv)) for u, v ∼ U[0,1]
– we’ll derive this formula later!
Sampling a d-Dimensional Sphere
• Q: How do we uniformly sample the d-
dimensional sphere
n n+1
S := {x ∈ ℝ : ∥x∥ = 1}?
2 2
– Hint: since the L2 norm x1 +⋯+ xd
doesn’t depend on the choice of
coordinate system, a d-dimensional
−∥x∥2
Gaussian e exhibits spherical
symmetry
• A: Just generate d+1 normally-distributed
coordinates x1, …, xd ∼ 𝒩0,1 and
normalize them (y := x/∥x∥).
Sampling a d-Dimensional Ball
• First sample a point y uniformly
from the d-dimensional sphere wrong
• Then draw a value a uniformly (nonuniform)
from [0,1]
• Uniform point on the ball is then
1/d
a y
• Food for thought: can you right
derive this formula (using the (uniform)
techniques we’ll see later)?
Sampling a Random Rotation
• In general, normal distribution is starting
point for sampling many other basic
distributions
• E.g., to sample

a d-dimensional rotation
matrix Q Q = I, det(Q) > 0:
– sample a d × d matrix A where each entry
is sampled from the standard (1D) normal
distribution
– compute the

spectral decomposition
A = QΛQ
– then Q is a matrix uniformly sampled*
from the rotation group

*with respect to the Haar measure


Beyond Basic Distributions

s w e c a n
e a t t r i c k
L o t s o f n e b a s i c
t o s a m p l
play u t i o n s …
dis t r i b
…but just l i k e i n t e g r a l s ,
there a re o n l y s o m a n y
distrib u t i o n s w e c a n
sample in c l o s e d f o r m .
p(x)
How can we hand
le more
general distributio
ns?

x
Discrete Distributions
Sampling Discrete Distributions
• Consider a discrete random variable X with
outcomes x1, …, xn, and associated probabilities probability 0.29

– p1, …, pn ∈ [0,1] 0.22

– p1 + ⋯ + pn = 1 0.13 0.14
0.11 0.11
• How do we construct an algorithmic procedure
that models this random variable?
• I.e., probability of sampling xi is pi
– repeatedly running the procedure N times
yields each outcome xi with fraction
approaching pi as N → ∞
• Several methods
– inversion, rejection, resampling, alias table…
Review: Cumulative Distribution
For any discrete random variable X, the cumulative distribution function
P(x) is the monotonically increasing function given by eac h cumulati
probability v e
the sum of
Pk i s just
the first k
probabilitie
s pi
1

3
4
1
2
1
4

0
2 3 4 5 6 7 8 9 10 11 12
Computing the CDF
k
• Want to compute Pk := ∑i=1 pi for k
∈ 1,…, n
• Q: What’s the asymptotic cost of evaluating all p1 p2 p3 p4 p5 p6
n sums directly?
• A: Have to sum one term for the 1st sum, two
terms for the 2nd…
2 2
– 1 + 2 + ⋯ + n = (n + n)/2 ⟹ O(n )
• Q: Can you do it faster, in O(n) time?
• A: Sure, just store the “partial sums”
– P1 = p1, P2 = P1 + p2, …
– known as cumulative sum/prefix sum/scan
Computing the CDF
k
• Want to compute Pk := ∑i=1 pi for k
∈ 1,…, n
• Q: What’s the asymptotic cost of evaluating all p2 p3 p4 p5 p6
n sums directly?
• A: Have to sum one term for the 1st sum, two
terms for the 2nd…
2 2
– 1 + 2 + ⋯ + n = (n + n)/2 ⟹ O(n )
• Q: Can you do it faster, in O(n) time?
• A: Sure, just store the “partial sums”
– P1 = p1, P2 = P1 + p2, …
– known as cumulative sum/prefix sum/scan
p1 p1
Computing the CDF
k
• Want to compute Pk := ∑i=1 pi for k
∈ 1,…, n
• Q: What’s the asymptotic cost of evaluating all p3 p4 p5 p6
n sums directly?
• A: Have to sum one term for the 1st sum, two
terms for the 2nd…
2 2
– 1 + 2 + ⋯ + n = (n + n)/2 ⟹ O(n )
• Q: Can you do it faster, in O(n) time?
• A: Sure, just store the “partial sums”
– P1 = p1, P2 = P1 + p2, …
– known as cumulative sum/prefix sum/scan p2 p2
p1 p1 p1
Computing the CDF
k
• Want to compute Pk := ∑i=1 pi for k
∈ 1,…, n
• Q: What’s the asymptotic cost of evaluating all p4 p5 p6
n sums directly?
• A: Have to sum one term for the 1st sum, two
terms for the 2nd…
2 2
– 1 + 2 + ⋯ + n = (n + n)/2 ⟹ O(n )
• Q: Can you do it faster, in O(n) time?
• A: Sure, just store the “partial sums”
– P1 = p1, P2 = P1 + p2, … p3 p3
– known as cumulative sum/prefix sum/scan p2 p2 p2
p1 p1 p1 p1
Computing the CDF
k
• Want to compute Pk := ∑i=1 pi for k
∈ 1,…, n
• Q: What’s the asymptotic cost of evaluating all p5 p6
n sums directly?
• A: Have to sum one term for the 1st sum, two
terms for the 2nd…
2 2
– 1 + 2 + ⋯ + n = (n + n)/2 ⟹ O(n )
• Q: Can you do it faster, in O(n) time?
• A: Sure, just store the “partial sums” p4 p4
– P1 = p1, P2 = P1 + p2, … p3 p3 p3
– known as cumulative sum/prefix sum/scan p2 p2 p2 p2
p1 p1 p1 p1 p1
Computing the CDF
k
• Want to compute Pk := ∑i=1 pi for k
∈ 1,…, n
• Q: What’s the asymptotic cost of evaluating all p6
n sums directly?
• A: Have to sum one term for the 1st sum, two
terms for the 2nd…
2 2
– 1 + 2 + ⋯ + n = (n + n)/2 ⟹ O(n )
p5 p5
• Q: Can you do it faster, in O(n) time?
• A: Sure, just store the “partial sums” p4 p4 p4
– P1 = p1, P2 = P1 + p2, … p3 p3 p3 p3
– known as cumulative sum/prefix sum/scan p2 p2 p2 p2 p2
p1 p1 p1 p1 p1 p1
Computing the CDF parallel scan
can
k build CDF
• Want to compute Pk := ∑i=1 pi for k
∈ 1,…, n O(log n) tim in
e
• Q: What’s the asymptotic cost of evaluating all
n sums directly?
• A: Have to sum one term for the 1st sum, two P6 = 1
terms for the 2nd…
2 2 p6
– 1 + 2 + ⋯ + n = (n + n)/2 ⟹ O(n ) P5
p5 p5
• Q: Can you do it faster, in O(n) time? P4

• A: Sure, just store the “partial sums” p4 p4 p4


P3
– P1 = p1, P2 = P1 + p2, … P2
p3 p3 p3 p3
– known as cumulative sum/prefix sum/scan p2 p2 p2 p2 p2
P1
p1 p1 p1 p1 p1 p1
• Q: Can you do it even faster, in O(log n) time? 0
Inverting the CDF
A: Interval [0,1]
Suppose you’ve already computed the into pieces prop i s s p l i t
ortional
CDF (array with values P1, …, Pn). to probabilities
pk
Q: How do you draw samples from the 1 P6 = 1
discrete distribution p?
u
A: Two steps: p6
P5
p5 p5
– sample u uniformly from [0,1] P4

– find the first k such that Pk > u p4 p4 p4


P3
p3 p3 p3 p3
(Then just return event xk) P2
p2 p2 p2 p2 p2
P1
Q: Why do e s t h i s w o r k ? p1 p1 p1 p1 p1 p1
0 0
Cost of Inversion Method
• Q: Given u, what’s the cost of locating the corresponding event?
– O(n): at worst, have to search whole list
• Q: How can you do better?
– use binary search O(log n) per sample
• Q: What’s the overall cost to take a single sample?
– O(n log n + log n) = O(n log n)
– But since we need only build CDF table once, cost is shared by all
subsequent samples
– In particular, for N ≫ n, approach amortized cost of O(log n) per sample
If distribution is large (n big), or need just a few samples
(N small), might want to take a different approach…
Alias Table
• Can get O(1) amortized sampling via alias table

rich ⟶
i c h c a n
• Construct via Robin Hood algorithm: o t e : r !
N e p o o r
b e c o m
– first compute the mean probability
p := (p1 + ⋯ + pn)/n
– anybody above p is “rich”, anyone below is
“poor”, anyone equal is “fair” p
– iterate over events from left-to-right
– then just rob from the rich, give to the poor…
– keep track of who you robbed from

⟵ poor
Alias Table
• Can get O(1) amortized sampling via alias table : s i n c e a r e a
Key idea k i s
o v e re d b y r e g i o n
c
• Construct via Robin Hood algorithm:
pr o p o r t i o n a l t o p k, c a n
a m p l e u n i f o r m l y
– first compute the mean probability ju s t s
p := (p1 + ⋯ + pn)/n r e c t a n g l e .
from this
– anybody above p is “rich”, anyone below is
“poor”, anyone equal is “fair”
– iterate over events from left-to-right
– then just rob from the rich, give to the poor…
– keep track of who you robbed from
• To sample from alias table:
– sample index k uniformly from 1, …, n
– use biased coin flip to pick event in kth column
Inversion
Inversion Method—Overview
• How do we get formulas for special distributions (normal, etc.)?
• Can be via continuous version of the inversion method
• Inversion method for continuous distributions:
– start with probability distribution p(x) we want to sample from
– construct the associated cumulative distribution P(x) via integration
−1
– find an explicit form for the inverse CDF, P (y)
−1
– apply P to uniformly-distributed samples
• Basic idea is to “warp” uniform samples into nonuniform ones
Inversion Method—1D
• More explicitly, consider a probability
density p : [a, b] → [0,1]
• Corresponding cumulative distribution
x

∫0
function given by P(x) := p(y) dy a b l e
a re t o b e
Qui t e r g i n
v e r y t h i n
−1 d o e
• The random variable Y := P (U[a,b])
to f o r m !
clo s e d
then has probability density p
– i.e., draw samples uniformly, then
apply inverse CDF
Inversion Method—Example
π
Consider the probability density p(x) := sin(πx) on x ∈ [0,1]
2
p(x)
Step I: integrate PDF to get the CDF
x

[ 2 cos(πy)] =
x

∫0
π 1 1 1
P(x) = 2
sin(πy) dy = − −2 cos(πx)+ 2
0
x
a b
P(x)
Step II: invert the CDF
1 1
y= 2
(1 − cos(πx))
⟺ cos(πx) = 1 − 2y
⟺ x = arccos(1 − 2y)/π
now just plug in random values y ∼ U[0,1] 0
a
x
Inversion Method—2D
Can extend inversion method to multiple dimensions, via the multivariable CDF.
2
E.g., to sample from a two-dimensional region Ω ⊂ ℝ : Step IV: invert the
distributions
Step I: define a map Step II: measure the
−1
to the region change in density f(u) := F (u)
−1
2
ϕ : [0,1] → Ω ρ(u, v) := det Jϕ(u, v) g(u, v) := Gu (v)

Step III: define 2D cumulative distributions Then “warp”


uniform samples:
1 s t in “column” s,
∫0 ∫0 ρ(u, v) dudv ∫0 ρ(s, v) dv
F(s) := Gs(t) :=
what fraction is ϕ( f(u), g(u, v))
1 1 1 covered by 1D
∫0 ∫0 ρ(u, v) dudv ∫0 ρ(s, v) dv interval [0,t]? u, v ∼ U[0,1]

what fraction of target area is covered


by 2D rectangle ϕ([0,s] × [0,1])? see: Arvo, “Stratified Sampling of 2-Manifolds” (2001)
Inversion Method—2D Disk
2 Step III. Integrate CDFs:
Example. Consider the 2D unit disk Ω := {x ∈ ℝ : ∥x∥ ≤ 1}.
F(s) = s
2
Step I. Define map from unit
Gs(t) = t
Step II. Calculate change in
2
square u, v ∈ [0,1] to unit disk: density due to “squashing”: Step IV. Invert CDFs:
ϕ(u, v) = u(cos(2πv), sin(2πv)) f(u) = u
[ 2πv cos(2πu) sin(2πu) ]
−2πv sin(2πu) cos(2πu) g(u, v) = v
Jϕ =
Step V. Compose with mapping:
1/2
ρ(u, v) = (det Jϕ) = − 2πv ϕ( f(u), g(u, v)) =
v(cos(2πu), sin(2πu))

ϕ(u, v) ϕ( f(u), g(u, v))


Inversion Method—2D Disk
nonuniform warping uniform warping reparameterization

ϕ ϕ∘η η
Q: Does this story remind e s a s e n s e o f
A: It’s what we ideally want to Maybe giv
you of anything? a t ’ s h a rd t o d o …
do with importance sampling! why th
Rejection Sampling
Rejection Sampling—Overview
• Inversion method is great when it works…
– …can’t always invert & integrate a distribution!
• In general, will need more universal techniques
• Rejection sampling
– stupid but simple
– essentially just (repeatedly) “guess and check”
– example of Las Vegas algorithm: always yields
correct result, but run time is nondeterministic
Rejection Sampling—Region
• Suppose we want to rejection sample
2
a region Ω ⊂ ℝ
A
– only requirement: must be able to
answer query “is a point x inside Ω”
• Let A ⊂ Ω be a region that we know
how to (efficiently) sample uniformly
– e.g., a rectangle bounding Ω
• until we find a point x ∈ Ω:
Ω
– draw a point x ∼ UA
Example—Unit Disk
• Suppose our region is the unit disk (1,1)
2
Ω := {x ∈ ℝ : ∥x∥ ≤ 1}
A
• Q: If I have a function rand() that gives uniform Ω
values in [0,1], how do I uniformly sample the
(smallest) box around Ω?
• A: Evaluate
u,v = [2*rand()-1, 2*rand()-1]

• Q: How do I check if this point is in the region Ω?


(-1,1)
• A: Check if u*u + v*v <= 1
Rejection Sampling—Distribution
• Rather than a region Ω, suppose our p(x)
goal is to draw samples from a given pmax
probability distribution p : Ω → [0,1]
reject
– must be able to sample Ω uniformly
– must have upper bound pmax on p accept

• until we pass the test: x


0 1
– sample a point x ∼ UΩ “Taller” regions
more likely to i n g
– sample a value u ∼ U[0,pmax] a l l y j u s t a p p l y
be sampled Basi c p l i n g ”
n r e j e c t i o n s a m
“ re g i o g r a p h !
e r t h e
– test whether p(x) < u t o re g i o n u n d
Rejection Sampling—Examples
For which of these examples will rejection sampling be least efficient?
A
A Ω A
Ω Ω

For which of these examples will rejection sampling be most efficient?


p(x) p(x)
pmax pmax

x x
0 1 0 1
Analysis of Rejection Sampling
• How many trials should we expect rejection sampling
to take on average? A
– each trial costs us more PRNG invocations
• Q: What’s probability p1 that we accept the 1st sample? Ω
• A: p1 = | Ω | / | A |
• Q: What’s the probability we accept the 2nd sample?
• A: p2 = (1 − p1)p1
• Q: What’s the probability we accept the kth sample? expected number of trials:
k−1
• A: pk = (1 − p1) p1
u n i t d i s k :
e.g., f o r
• Q: What’s the expected number of trials? ) ≈ 1 . 27
∞ ∞
1 / (π / 4
k−1
• A: ∑k=1 kpk = ∑k=1 k(1 − p1) p1 = 1/p1
Disk—Rejection Sampling vs. Inversion
• Earlier, saw how to sample from the disk using inversion
• Which approach is better: rejection or inversion?
• Rejection
2 2
– evaluate x + y multiple times (~1.27 on average)
– call PRNG multiple times (~2 x 1.27 = 2.54 on average)
• Inversion
– evaluate square root, cosine, and sine once
– call PRNG twice (radius and angle)
• Winner: …depends on a lot of factors!
– speed of PRNG, cost of special functions,
multithreading…
• Universal rule: profile, profile, profile!
– i.e., which gives best results for actual application? inversion rejection
Rejection Sampling the n-Dimensional Ball
• Still, overall, rejection sampling is volume

looking pretty attractive for 2D disk


• But what if we go up in dimension?
• Fraction of d-dimensional box covered
by d-dimensional ball goes up… then d
5 10 15 20
back down! trials
40
• Hence, number of trials goes up 30
dramatically 20
r e i s M u s t t h i n k a b o
e n e r a l t h e u t
I n g l l e t ! y o u r p r o b l e m 10
v e r b u ,
no sil pick the right t
ool. d
5 10 15 20
Resampling
Importance Resampling
• Suppose we want samples from a “tough” distribution p(x) but:
– we don’t know how to invert it analytically
p(x)
– we don’t know how to rejection sample it efficiently
• Importance resampling combines discrete & continuous
strategies:
– take samples of “easy” continuous distribution ?
– construct discrete distribution from these samples x
– sample from discrete distribution
• PROS: always works, trivial to implement, just as good as sampling from p(x)
directly (e.g., in terms of variance reduction)
• CONS: adds a bit of bias to any resulting estimator, due to discretization error
Importance Resampling
• Suppose we want to generate N p(x) q(x)
samples Xi from a distribution
p : Ω → [0,1]
• Let q : Ω → [0,1] be a distribution we
can (efficiently) draw samples from x x
– ideally, q should approximate p w
sam
ple
ei s
gh
• Step I. Draw M ≫ N samples Yi from q ts

• Step II. Compute weights m p l e


wi = p(Yi)/q(Yi) resa

frequency
• Step III. Draw N samples Xi from the frequency
discrete distribution with events Yi and
normalized weights Wi := wi / ∑j wj. X Y
Summary
Takeaway—Monte Carlo Integration
If you forget everything else from this class, remember this:

If you ever encounter a really tricky integral—


or just want to quickly prototype something—
99% of the time you can just write some
Ev e r y t h i n g
“quick and dirty” code that does: t a n
else i s j u s
optim i z a t i o n !
rejection sampling
+
basic Monte Carlo

Pick a point x at random. Check if it’s in the


domain. If so, add f(x) to your average. Repeat.
Thanks!

MONTE CARLO METHODS


AND APPLICATIONS
Carnegie Mellon University | 21-387 | 15-327/627 | 15-860 | Fall 2023

You might also like