An Introduction To Probability Theory

An introduction to probability theory
Christel Geiss and Stefan Geiss

February 19, 2004
Contents
1 Probability spaces
1.1 Definition of -algebras . . . . . . . . . . . . . . . . . . . . .
1.2 Probability measures . . . . . . . . . . . . . . . . . . . . . .
1.3 Examples of distributions . . . . . . . . . . . . . . . . . . .
1.3.1 Binomial distribution with parameter 0 < p < 1 . . .
1.3.2 Poisson distribution with parameter > 0 . . . . . .
1.3.3 Geometric distribution with parameter 0 < p < 1 . .
1.3.4 Lebesgue measure and uniform distribution . . . . .
1.3.5 Gaussian distribution on with mean m and
variance 2 > 0 . . . . . . . . . . . . . . . . . . . . .
1.3.6 Exponential distribution on with parameter > 0
1.3.7 Poissons Theorem . . . . . . . . . . . . . . . . . . .
1.4 A set which is not a Borel set . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
7
8
12
20
20
21
21
21
.
.
.
.
22
22
24
25
2 Random variables
29
2.1 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2 Measurable maps . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3 Integration
3.1 Definition of the expected value . . . . .
3.2 Basic properties of the expected value . .
3.3 Connections to the Riemann-integral . .
3.4 Change of variables in the expected value
3.5 Fubinis Theorem . . . . . . . . . . . . .
3.6 Some inequalities . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
39
39
42
48
49
51
58
4 Modes of convergence
63
4.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2 Some applications . . . . . . . . . . . . . . . . . . . . . . . . . 64
CONTENTS
Introduction
The modern period of probability theory is connected with names like S.N.
Bernstein (1880-1968), E. Borel (1871-1956), and A.N. Kolmogorov (19031987). In particular, in 1933 A.N. Kolmogorov published his modern approach of Probability Theory, including the notion of a measurable space
and a probability space. This lecture will start from this notion, to continue
with random variables and basic parts of integration theory, and to finish
with some first limit theorems.
The lecture is based on a mathematical axiomatic approach and is intended
for students from mathematics, but also for other students who need more
mathematical background for their further studies. We assume that the
integration with respect to the Riemann-integral on the real line is known.
The approach, we follow, seems to be in the beginning more difficult. But
once one has a solid basis, many things will be easier and more transparent
later. Let us start with an introducing example leading us to a problem
which should motivate our axiomatic approach.
Example. We would like to measure the temperature outside our home.
We can do this by an electronic thermometer which consists of a sensor
outside and a display, including some electronics, inside. The number we get
from the system is not correct because of several reasons. For instance, the
calibration of the thermometer might not be correct, the quality of the powersupply and the inside temperature might have some impact on the electronics.
It is impossible to describe all these sources of uncertainty explicitly. Hence
one is using probability. What is the idea?
Let us denote the exact temperature by T and the displayed temperature
by S, so that the difference T S is influenced by the above sources of
uncertainty. If we would measure simultaneously, by using thermometers of
the same type, we would get values S1 , S2 , ... with corresponding differences
D1 := T S1 ,
D2 := T S2 ,
D3 := T S3 , ...
Intuitively, we get random numbers D1 , D2 , ... having a certain distribution.

How to develop an exact mathematical theory out of this?
Firstly, we take an abstract set . Each element will stand for a
specific configuration of our outer sources influencing the measured value.
CONTENTS
Secondly, we take a function

f :
which gives for all the difference f () = T S. From properties of this
function we would like to get useful information of our thermometer and, in
particular, about the correctness of the displayed values. So far, the things
are purely abstract and at the same time vague, so that one might wonder if
this could be helpful. Hence let us go ahead with the following questions:
Step 1: How to model the randomness of , or how likely an is? We do
this by introducing the probability spaces in Chapter 1.
Step 2: What mathematical properties of f we need to transport the randomness from to f ()? This yields to the introduction of the random
variables in Chapter 2.
Step 3: What are properties of f which might be important to know in
practice? For example the mean-value and the variance, denoted by
and
(f f )2.
If the first expression is 0, then the calibration of the thermometer is right,

if the second one is small the displayed values are very likely close to the real
temperature. To define these quantities one needs the integration theory
developed in Chapter 3.
Step 4: Is it possible to describe the distributions the values of f may take?
Or before, what do we mean by a distribution? Some basic distributions are
discussed in Section 1.3.
Step 5: What is a good method to estimate f ? We can take a sequence of
independent (take this intuitive for the moment) random variables f1 , f2 , ...,
having the same distribution as f , and expect that
1
n
fi () and
i=1
are close to each other. This yields us to the strong law of large numbers
discussed in Section 4.2.
Notation. Given a set and subsets A, B , then the following notation
is used:
intersection: A B
union: A B
set-theoretical minus: A\B
complement:
Ac
empty set:
real numbers:
natural numbers:
rational numbers:
Given real numbers , , we use
=
=
=
=
=
{ : A and B}
{ : A or (or both) B}
{ : A and B}
{ : A}
set, without any element
= {1, 2, 3, ...}
:= min {, }.
Chapter 1
Probability spaces
In this chapter we introduce the probability space, the fundamental notion
of probability theory. A probability space (, F, P) consists of three components.
(1) The elementary events or states which are collected in a non-empty
set .
Example 1.0.1 (a) If we roll a die, then all possible outcomes are the
numbers between 1 and 6. That means
= {1, 2, 3, 4, 5, 6}.
(b) If we flip a coin, then we have either heads or tails on top, that
means
= {H, T }.
If we have two coins, then we would get
= {(H, H), (H, T ), (T, H), (T, T )}.
(c) For the lifetime of a bulb in hours we can choose
= [0, ).
(2) A -algebra F, which is the system of observable subsets of . Given
and some A F, one can not say which concrete occurs, but one
can decide whether A or A. The sets A F are called events: an
event A occurs if A and it does not occur if A.
Example 1.0.2 (a) The event the die shows an even number can be
described by
A = {2, 4, 6}.
CHAPTER 1. PROBABILITY SPACES

(b) Exactly one of two coins shows heads is modeled by
A = {(H, T ), (T, H)}.
(c) The bulb works more than 200 hours we express via
A = (200, ).
(3) A measure P, which gives a probability to any event A , that

means to all A F.
Example 1.0.3 (a) We assume that all outcomes for rolling a die are
equally likely, that is
P({}) = 16 .
Then
P({2, 4, 6}) = 12 .
(b) If we assume we have two fair coins, that means they both show head
and tail equally likely, the probability that exactly one of two coins
shows head is
P({(H, T ), (T, H)}) = 21 .
(c) The probability of the lifetime of a bulb we will consider at the end of
Chapter 1.
For the formal mathematical approach we proceed in two steps: in a first
step we define the -algebras F, here we do not need any measure. In a
second step we introduce the measures.
1.1
Definition of -algebras
The -algebra is a basic tool in probability theory. It is the set the probability measures are defined on. Without this notion it would be impossible
to consider the fundamental Lebesgue measure on the interval [0, 1] or to
consider Gaussian measures, without which many parts of mathematics can
not live.
Definition 1.1.1 [-algebra, algebra, measurable space] Let be
a non-empty set. A system F of subsets A is called -algebra on if
(1) , F,
(2) A F implies that Ac := \A F,
1.1. DEFINITION OF -ALGEBRAS

(3) A1 , A2 , ... F implies that
i=1
Ai F.
The pair (, F), where F is a -algebra on , is called measurable space.

If one replaces (3) by
(3 ) A, B F implies that A B F,
then F is called an algebra.
Every -algebra is an algebra. Sometimes, the terms -field and field are
used instead of -algebra and algebra. We consider some first examples.
Example 1.1.2 [-algebras]
(a) The largest -algebra on : if F = 2 is the system of all subsets
A , then F is a -algebra.
(b) The smallest -algebra: F = {, }.
(c) If A , then F = {, , A, Ac } is a -algebra.
If = {1 , ..., n }, then any algebra F on is automatically a -algebra.
However, in general this is not the case. The next example gives an algebra,
which is not a -algebra:
Example 1.1.3 [algebra, which is not a -algebra] Let G be the
system of subsets A such that A can be written as
A = (a1 , b1 ] (a2 , b2 ] (an , bn ]
where a1 b1 an bn with the convention that
(a, ] = (a, ). Then G is an algebra, but not a -algebra.
Unfortunately, most of the important algebras can not be constructed
explicitly. Surprisingly, one can work practically with them nevertheless. In
the following we describe a simple procedure which generates algebras. We
start with the fundamental
Proposition 1.1.4 [intersection of -algebras is a -algebra] Let
be an arbitrary non-empty set and let Fj , j J, J = , be a family of
-algebras on , where J is an arbitrary index set. Then
F :=
Fj
jJ
is a -algebra as well.
10
Proof. The proof is very easy, but typical and fundamental. First we notice
that , Fj for all j J, so that , jJ Fj . Now let A, A1 , A2 , ...
jJ Fj . Hence A, A1 , A2 , ... Fj for all j J, so that (Fj are algebras!)
A = \A Fj
Ai F j
and
i=1
for all j J. Consequently,
Fj
Ai
and
i=1
jJ
Fj .
jJ
Proposition 1.1.5 [smallest -algebra containing a set-system]

Let be an arbitrary non-empty set and G be an arbitrary system of subsets
A . Then there exists a smallest -algebra (G) on such that
G (G).
Proof. We let
J := {C is a algebra on such that G C} .
According to Example 1.1.2 one has J = , because
G 2
and 2 is a algebra. Hence
C
(G) :=
CJ
yields to a -algebra according to Proposition 1.1.4 such that (by construction) G (G). It remains to show that (G) is the smallest -algebra
containing G. Assume another -algebra F with G F. By definition of J
we have that F J so that
C F.
(G) =
CJ
The construction is very elegant but has, as already mentioned, the slight
disadvantage that one cannot explicitly construct all elements of (G). Let
us now turn to one of the most important examples, the Borel -algebra on
. To do this we need the notion of open and closed sets.
1.1. DEFINITION OF -ALGEBRAS
11
Definition 1.1.6 [open and closed sets]

(1) A subset A is called open, if for each x A there is an > 0
such that (x , x + ) A.
(2) A subset B is called closed, if A := \B is open.
It should be noted, that by definition the empty set is open and closed.
Proposition 1.1.7 [Generation of the Borel -algebra on
let
G0
G1
G2
G3
G4
G5
be
be
be
be
be
be
the
the
the
the
the
the
system
system
system
system
system
system
of
of
of
of
of
of
all
all
all
all
all
all
] We
open subsets of ,
closed subsets of ,
intervals (, b], b ,
intervals (, b), b ,
intervals (a, b], < a < b < ,
intervals (a, b), < a < b < .
Then (G0 ) = (G1 ) = (G2 ) = (G3 ) = (G4 ) = (G5 ).

Definition 1.1.8 [Borel -algebra on ] The -algebra constructed in
Proposition 1.1.7 is called Borel -algebra and denoted by B().
Proof of Proposition 1.1.7. We only show that
(G0 ) = (G1 ) = (G3 ) = (G5 ).
Because of G3 G0 one has
(G3 ) (G0 ).
Moreover, for < a < b < one has that
(a, b) =
(, b)\(, a +
n=1
1
)
n
(G3 )
so that G5 (G3 ) and

(G5 ) (G3 ).
Now let us assume a bounded non-empty open set A
there is a maximal x > 0 such that
(x x , x + x ) A.
Hence
A=
xA
(x x , x + x ),
For all x A
12
which proves G0 (G5 ) and

(G0 ) (G5 ).
Finally, A G0 implies Ac G1 (G1 ) and A (G1 ). Hence G0 (G1 )
and
(G0 ) (G1 ).
The remaining inclusion (G1 ) (G0 ) can be shown in the same way.
1.2
Probability measures
Now we introduce the measures we are going to use:

Definition 1.2.1 [probability measure, probability space]
(, F) be a measurable space.
Let
(1) A map : F [0, ] is called measure if () = 0 and for all

A1 , A2 , ... F with Ai Aj = for i = j one has
Ai =
i=1
(Ai ).
(1.1)
i=1
The triplet (, F, ) is called measure space.

(2) A measure space (, F, ) or a measure is called -finite provided
that there are k , k = 1, 2, ..., such that
(a) k F for all k = 1, 2, ...,
(b) i j = for i = j,
(c) =
k=1
k ,
(d) (k ) < .
The measure space (, F, ) or the measure are called finite if
() < .
(3) A measure space (, F, ) is called probability space and probability measure provided that () = 1.
Example 1.2.2 [Dirac and counting measure]

(a) Dirac measure: For F = 2 and a fixed x0 we let
x0 (A) :=
1 : x0 A
.
0 : x0 A
1.2. PROBABILITY MEASURES
13
(b) Counting measure: Let := {1 , ..., N } and F = 2 . Then

(A) := cardinality of A.
Let us now discuss a typical example in which the algebra F is not the set
of all subsets of .
Example 1.2.3 Assume there are n communication channels between the
points A and B. Each of the channels has a communication rate of > 0
(say bits per second), which yields to the communication rate k, in case
k channels are used. Each of the channels fails with probability p, so that
we have a random communication rate R {0, , ..., n}. What is the right
model for this? We use
:= { = (1 , ..., n ) : i {0, 1})
with the interpretation: i = 0 if channel i is failing, i = 1 if channel i is
working. F consists of all possible unions of
Ak := { : 1 + + n = k} .
Hence Ak consists of all such that the communication rate is k. The
system F is the system of observable sets of events since one can only observe
how many channels are failing, but not which channels are failing. The
measure P is given by
P(Ak ) :=
n nk
p (1 p)k ,
k
0 < p < 1.
Note that P describes the binomial distribution with parameter p on

{0, ..., n} if we identify Ak with the natural number k.
We continue with some basic properties of a probability measure.
Proposition 1.2.4 Let (, F, P) be a probability space. Then the following
assertions are true:
(1) Without assuming that
P() = 0.
P()
= 0 the -additivity (1.1) implies that
(2) If A1 , ..., An F such that Ai Aj = if i = j, then

n
i=1 P (Ai ).
P(A\B) = P(A) P(A B).

If B , then P(B c ) = 1 P(B).
If A1 , A2 , ... F then P (
i=1 Ai )
i=1 P (Ai ).
(3) If A, B F, then
(4)
(5)
P(
n
i=1
Ai ) =
14
(6) Continuity from below: If A1 , A2 , ... F such that A1 A2

A3 , then
lim
P(An) = P
An .
n=1
(7) Continuity from above: If A1 , A2 , ... F such that A1 A2

A3 , then
lim
P(An) = P
An .
n=1
Proof. (1) Here one has for An := that
P() = P
An
P (An) =
n=1
n=1
P () ,
n=1
so that P() = 0 is the only solution.

(2) We let An+1 = An+2 = = , so that
=P
Ai
i=1
Ai
i=1
P (Ai) =
i=1
P (Ai) ,
i=1
because of P() = 0.
(3) Since (A B) (A\B) = , we get that
P(A B) + P(A\B) = P ((A B) (A\B)) = P(A).

(4) We apply (3) to A = and observe that \B = B c by definition and
B = B.
(5) Put B1 := A1 and Bi := Ac1 Ac2 Aci1 Ai for i = 2, 3, . . . Obviously,
P(Bi) P(Ai) for all i. Since the Bis are disjoint and i=1 Ai = i=1 Bi it
follows
Ai
=P
i=1
Bi
P(Bi)
i=1
i=1
P(Ai).
i=1
(6) We define B1 := A1 , B2 := A2 \A1 , B3 := A3 \A2 , B4 := A4 \A3 , ... and

get that
Bn =
n=1
An
and Bi Bj =
n=1
for i = j. Consequently,
An
n=1
since
N
n=1
=P
Bn
n=1
=
n=1
P (Bn) = Nlim
Bn = AN . (7) is an exercise.
n=1
P (Bn) = Nlim
P(AN )
15
Definition 1.2.5 [lim inf n An and lim supn An ] Let (, F) be a measurable

space and A1 , A2 , ... F. Then
lim inf An :=
n
Ak
and
lim sup An :=
n
n=1 k=n
Ak .
n=1 k=n
The definition above says that lim inf n An if and only if all events An ,
except a finite number of them, occur, and that lim supn An if and only
if infinitely many of the events An occur.
Definition 1.2.6 [lim inf n n and lim supn n ] For 1 , 2 , ... we let
lim inf n := lim inf k
n
and
n kn
lim sup n := lim sup k .

n
kn
Remark 1.2.7 (1) The value lim inf n n is the infimum of all c such that
there is a subsequence n1 < n2 < n3 < such that limk nk = c.
(2) The value lim supn n is the supremum of all c such that there is a
subsequence n1 < n2 < n3 < such that limk nk = c.
(3) By definition one has that
lim inf n lim sup n .
n
(4) For example, taking n = (1)n , gives

lim inf n = 1 and
n
lim sup n = 1.
n
Proposition 1.2.8 [Lemma of Fatou] Let (, F, P) be a probability space

and A1 , A2 , ... F. Then
lim inf P (An ) lim sup P (An )
lim inf An
n
lim sup An .
n
The proposition will be deduced from Proposition 3.2.6 below.

Definition 1.2.9 [independence of events] Let (, F, P) be a probability space. The events A1 , A2 , ... F are called independent, provided
that for all n and 1 k1 < k2 < < kn one has that
P (Ak
Ak2 Akn ) = P (Ak1 ) P (Ak2 ) P (Akn ) .
16
One can easily see that only demanding
P (A1 A2 An) = P (A1) P (A2) P (An) .

would not make much sense: taking A and B with
P(A B) = P(A)P(B)
and C = gives
P(A B C) = P(A)P(B)P(C),
which is surely not, what we had in mind.

Definition 1.2.10 [conditional probability] Let (, F, P) be a probability space, A F with P(A) > 0. Then
A)
P(B|A) := P(B
P(A) ,
for B F,
is called conditional probability of B given A.

As a first application let us consider the Bayes formula. Before we formulate
this formula in Proposition 1.2.12 we consider A, B F, with 0 < P(B) < 1
and P(A) > 0. Then
A = (A B) (A B c ),
where (A B) (A B c ) = , and therefore,
P(A)
=
=
P(A B) + P(A B c)
P(A|B)P(B) + P(A|B c)P(B c).
This implies
A)
P(B|A) = P(B
P(A)
=
=
P(A|B)P(B)
P(A)
P(A|B)P(B)
P(A|B)P(B) + P(A|B c)P(B c) .
Let us consider an
Example 1.2.11 A laboratory blood test is 95% effective in detecting a
certain disease when it is, in fact, present. However, the test also yields a
false positive result for 1% of the healthy persons tested. If 0.5% of the
population actually has the disease, what is the probability a person has the
disease given his test result is positive? We set
B := person has the disease,
A := the test result is positive.
17
Hence we have
P(A|B)
P(A|B c)
P(B)
= P(a positive test result|person has the disease) = 0.95,

= 0.01,
= 0.005.
Applying the above formula we get

0.95 0.005
0.323.
P(B|A) = 0.95 0.005
+ 0.01 0.995
That means only 32% of the persons whose test results are positive actually
have the disease.
Proposition 1.2.12 [Bayes formula] Assume A, Bj F, with =
n
j=1 Bj , with Bi Bj = for i = j and P(A) > 0, P(Bj ) > 0 for
j = 1, . . . , n. Then
P(A|Bj )P(Bj ) .
P(A|Bk )P(Bk )
P(Bj |A) =
n
k=1
The proof is an exercise.

Proposition 1.2.13 [Lemma of Borel-Cantelli] Let (, F, P) be a
probability space and A1 , A2 , ... F. Then one has the following:
(1) If
n=1
P(An) < , then P (lim supn An) = 0.
n=1
(2) If A1 , A2 , ... are assumed to be independent and

P (lim supn An) = 1.
n=1
Proof. (1) It holds by definition lim supn An =
Ak
k=n
Ak . By
k=n+1
and the continuity of
P(An) = , then
Ak
k=n
P from above (see Proposition 1.2.4) we get

lim sup An
Ak
n=1 k=n
lim
k=n
lim
Ak
k=n
P (Ak ) = 0,
18
where the last inequality follows from Proposition 1.2.4.

(2) It holds that
lim sup An
= lim inf
n
Acn
Acn .
=
n=1 k=n
So, we would need to show that
Acn
= 0.
n=1 k=n
Letting Bn :=
k=n
Ack we get that B1 B2 B3 , so that
Acn
P(Bn)
Ack
= 0.
= lim
n=1 k=n
so that it suffices to show that
P(Bn) = P
k=n
Since the independence of A1 , A2 , ... implies the independence of Ac1 , Ac2 , ...,
we finally get (setting pn := P(An )) that
Ack
k=n
lim
N ,N n
Ack
k=n
P (Ack )
lim
N ,N n
k=n
N
(1 pk )
lim
N ,N n
k=n
N
epn
lim
N ,N n
lim
N ,N n
k=n
N
k=n
pn
= e k=n pn
= e
= 0
where we have used that 1 x ex for x 0.
Although the definition of a measure is not difficult, to prove existence and
uniqueness of measures may sometimes be difficult. The problem lies in the
fact that, in general, the -algebras are not constructed explicitly, one only
knows its existence. To overcome this difficulty, one usually exploits
19
odorys extension theorem]

Proposition 1.2.14 [Carathe
Let be a non-empty set and G be an algebra on such that
F := (G).
Assume that
(1)
P0 : G [0, 1] satisfies:
P0() = 1.
i=1
(2) If A1 , A2 , ... F, Ai Aj = for i = j, and
P0
Ai =
i=1
P0(Ai).
i=1
P on F such that
Then there exists a unique probability measure
P(A) = P0(A)
Ai G, then
for all A G.
Proof. See [3] (Theorem 3.1).

As an application we construct (more or less without rigorous proof) the
product space
(1 2 , F1 F2 , P1 P2 )
of two probability spaces (1 , F1 , P1 ) and (2 , F2 , P2 ). We do this as follows:

(1) 1 2 := {(1 , 2 ) : 1 1 , 2 2 }.
(2) F1 F2 is the smallest -algebra on 1 2 which contains all sets of
type
A1 A2 := {(1 , 2 ) : 1 A1 , 2 A2 }
with A1 F1 , A2 F2 .
(3) As algebra G we take all sets of type

A := A11 A12 (An1 An2 )
with Ak1 F1 , Ak2 F2 , and (Ai1 Ai2 ) Aj1 Aj2 = for i = j.
Finally, we define : G [0, 1] by
n
A11
A12
(An1
An2 )
:=
P1(Ak1 )P2(Ak2 ).
k=1
Definition 1.2.15 [product of probability spaces] The extension of

to F1 F2 according to Proposition 1.2.14 is called product measure and
usually denoted by P1 P2 . The probability space (1 2 , F1 F2 , P1 P2 )
is called product probability space.
20
One can prove that

(F1 F2 ) F3 = F1 (F2 F3 ) and (P1 P2 ) P3 = P1 (P2 P3 ).
Using this approach we define the the Borel -algebra on
n.
Definition 1.2.16 For n {1, 2, ...} we let

B(n ) := B() B().
There is a more natural approach to define the Borel -algebra on n : it is
the smallest -algebra which contains all sets which are open which are open
with respect to the euclidean metric in n . However to be efficient, we have
chosen the above one.
If one is only interested in the uniqueness of measures one can also use the
odorys extension theofollowing approach as a replacement of Carathe
rem:
Definition 1.2.17 [-system] A system G of subsets A is called system, provided that
AB G
for all A, B G.
Proposition 1.2.18 Let (, F) be a measurable space with F = (G), where

G is a -system. Assume two probability measures P1 and P2 on F such that
P1(A) = P2(A)
Then
for all A G.
P1(B) = P2(B) for all B F.
1.3
Examples of distributions
1.3.1
Binomial distribution with parameter 0 < p < 1
(1) := {0, 1, ..., n}.

(2) F := 2 (system of all subsets of ).
(3)
P(B) = n,p(B) :=
n
k=0
n
k
pk (1 p)nk k (B), where k is the Dirac

measure introduced in Definition 1.2.2.
Interpretation: Coin-tossing with one coin, such that one has head with
probability p and tail with probability 1 p. Then n,p ({k}) is equals the
probability, that within n trials one has k-times head.
1.3. EXAMPLES OF DISTRIBUTIONS
1.3.2
21
Poisson distribution with parameter > 0
(1) := {0, 1, 2, 3, ...}.

(3)
P(B) = (B) :=
k
(B).
k=0 e
k! k
The Poisson distribution is used for example to model jump-diffusion processes: the probability that one has k jumps between the time-points s and
t with 0 s < t < , is equal to (ts) ({k}).
1.3.3
Geometric distribution with parameter 0 < p < 1
(1) := {0, 1, 2, 3, ...}.

(3)
P(B) = p(B) :=
k=0 (1
p)k pk (B).
Interpretation: The probability that an electric light bulb breaks down

is p (0, 1). The bulb does not have a memory, that means the break
down is independent of the time the bulb is already switched on. So, we
get the following model: at day 0 the probability of breaking down is p. If
the bulb survives day 0, it breaks down again with probability p at the first
day so that the total probability of a break down at day 1 is (1 p)p. If we
continue in this way we get that breaking down at day k has the probability
(1 p)k p.
1.3.4
Lebesgue measure and uniform distribution
odorys extension theorem, we shall construct the Lebesgue

Using Carathe
measure on compact intervals [a, b] and on . For this purpose we let
(1) := [a, b], < a < b < ,
(2) F = B([a, b]) := {B = A [a, b] : A B()}.
(3) As generating algebra G for B([a, b]) we take the system of subsets
A [a, b] such that A can be written as
A = (a1 , b1 ] (a2 , b2 ] (an , bn ]
or
A = {a} (a1 , b1 ] (a2 , b2 ] (an , bn ]
where a a1 b1 an bn b. For such a set A we let
(ai , bi ]
i=1
(bi ai ).
:=
i=1
22
Definition 1.3.1 [Lebesgue measure] The unique extension of 0 to

B([a, b]) according to Proposition 1.2.14 is called Lebesgue measure and
denoted by .
We also write (B) =
d(x). Letting
P(B) := b 1 a (B)
for B B([a, b]),
we obtain the uniform distribution on [a, b]. Moreover, the Lebesgue

measure can be uniquely extended to a -finite measure on B() such that
((a, b]) = b a for all < a < b < .
1.3.5
Gaussian distribution on
variance 2 > 0
with mean m and
(1) := .
(2) F := B() Borel -algebra.
(3) We take the algebra G considered in Example 1.1.3 and define
n
P0(A) :=
bi
i=1
ai
1
2 2
(xm)2
2 2
dx
for A := (a1 , b1 ](a2 , b2 ] (an , bn ] where we consider the Riemannintegral on the right-hand side. One can show (we do not do this here,
but compare with Proposition 3.5.8 below) that P0 satisfies the assumptions of Proposition 1.2.14, so that we can extend P0 to a probability
measure Nm,2 on B().
The measure Nm,2 is called Gaussian distribution (normal distribution) with mean m and variance 2 . Given A B() we write
pm,2 (x)dx with pm,2 (x) :=
Nm,2 (A) =
A
1
2 2
(xm)2
2 2
The function pm,2 (x) is called Gaussian density.
1.3.6
Exponential distribution on
>0
(1) := .
(2) F := B() Borel -algebra.
with parameter
1.3. EXAMPLES OF DISTRIBUTIONS
23
(3) For A and G as in Subsection 1.3.5 we define
P0(A) :=
i=1
bi
p (x)dx with p (x) := 1I[0,) (x)ex
ai
Again, P0 satisfies the assumptions of Proposition 1.2.14, so that we

can extend P0 to the exponential distribution with parameter
and density p (x) on B().
Given A B() we write
(A) =
p (x)dx.
A
The exponential distribution can be considered as a continuous time version

of the geometric distribution. In particular, we see that the distribution does
not have a memory in the sense that for a, b 0 we have
([a + b, )|[a, )) = ([b, )),
where we have on the left-hand side the conditional probability. In words: the
probability of a realization larger or equal to a + b under the condition that
one has already a value larger or equal a is the same as having a realization
larger or equal b. Indeed, it holds
([a + b, )|[a, )) =
=
=
([a + b, ) [a, ))
([a, ))
x
a+b e dx
x
e dx
a
(a+b)
ea
= ([b, )).
Example 1.3.2 Suppose that the amount of time one spends in a post office
1
is exponential distributed with = 10
.
(a) What is the probability, that a customer will spend more than 15 minutes?
(b) What is the probability, that a customer will spend more than 15 minutes in the post office, given that she or he is already there for at least
10 minutes?
1
The answer for (a) is ([15, )) = e15 10 0.220.

1
([15, )|[10, )) = ([5, )) = e5 10 0.604.
For (b) we get
24
1.3.7
Poissons Theorem
For large n and small p the Poisson distribution provides a good approximation for the binomial distribution.
Proposition 1.3.3 [Poissons Theorem] Let > 0, pn (0, 1), n =
1, 2, ..., and assume that npn as n . Then, for all k = 0, 1, . . . ,
n,pn ({k}) ({k}), n .
Proof. Fix an integer k 0. Then
n k
pn (1 pn )nk
k
n(n 1) . . . (n k + 1) k
=
pn (1 pn )nk
k!
1 n(n 1) . . . (n k + 1)
=
(npn )k (1 pn )nk .
k
k!
n
n,pn ({k}) =
Of course, limn (npn )k = k and limn n(n1)...(nk+1)

= 1. So we have
nk
nk
to show that limn (1 pn )

= e . By npn we get that there exist
n such that
npn = + n with lim n = 0.
n
Choose 0 > 0 and n0 1 such that |n | 0 for all n n0 . Then

1
+ 0
n
nk
+ n
n
nk
0
n
nk
Using lHospitals rule we get

nk
+ 0
lim ln 1
n
n
lim (n k) ln 1
+ 0
n
0
ln 1 +
n
= lim
n 1/(n k)
+0
0
1 +
n
n2
= lim
2
n
1/(n k)
= ( + 0 ).
Hence
e(+0 ) = lim
+ 0
n
nk
lim
In the same way we get

lim
+ n
n
nk
e(0 ) .
+ n
n
nk
1.4. A SET WHICH IS NOT A BOREL SET
25
Finally, since we can choose 0 > 0 arbitrarily small

nk
lim (1 pn )
1.4
= lim
+ n
1
n
nk
= e .
A set which is not a Borel set
In this section we shall construct a set which is a subset of (0, 1] but not an
element of
B((0, 1]) := {B = A (0, 1] : A B()} .
Before we start we need
Definition 1.4.1 [-system] A class L is a -system if
(1) L,
(2) A, B L and A B imply B\A L,
(3) A1 , A2 , L and An An+1 , n = 1, 2, . . . imply
n=1
An L.
Proposition 1.4.2 [--Theorem] If P is a -system and L is a system, then P L implies (P) L.

Definition 1.4.3 [equivalence relation] An relation on a set X is
called equivalence relation if and only if
(1) x x for all x X (reflexivity),
(2) x y implies x y for x, y X (symmetry),
(3) x y and y z imply x z for x, y, z X (transitivity).
Given x, y (0, 1] and A (0, 1], we also need the addition modulo one
x y :=
x+y
x+y1
if x + y (0, 1]
otherwise
and
A x := {a x : a A}.
Now define
L := {A B((0, 1]) such that
A x B((0, 1]) and (A x) = (A) for all x (0, 1]}.
26
Lemma 1.4.4 L is a -system.

Proof. The property (1) is clear since x = . To check (2) let A, B L
and A B, so that
(A x) = (A)
and
(B x) = (B).
We have to show that B \ A L. By the definition of it is easy to see that

A B implies A x B x and
(B x) \ (A x) = (B \ A) x,
and therefore, (B \ A) x B((0, 1]). Since is a probability measure it
follows
(B \ A) =
=
=
=
(B) (A)
(B x) (A x)
((B x) \ (A x))
((B \ A) x)
and B\A L. Property (3) is left as an exercise.

Finally, we need the axiom of choice.
Proposition 1.4.5 [Axiom of choice] Let I be a set and (M )I be a
system of non-empty sets M . Then there is a function on I such that
: m M .
In other words, one can form a set by choosing of each set M a representative
m .
Proposition 1.4.6 There exists a subset H (0, 1] which does not belong
to B((0, 1]).
Proof. If (a, b] [0, 1], then (a, b] L. Since
P := {(a, b] : 0 a < b 1}
is a -system which generates B((0, 1]) it follows by the --Theorem 1.4.2
that
B((0, 1]) L.
Let us define the equivalence relation
xy
if and only if
xr =y
for some rational
r (0, 1].
Let H (0, 1] be consisting of exactly one representative point from each

equivalence class (such set exists under the assumption of the axiom of
1.4. A SET WHICH IS NOT A BOREL SET
27
choice). Then H r1 and H r2 are disjoint for r1 = r2 : if they were

not disjoint, then there would exist h1 r1 (H r1 ) and h2 r2 (H r2 )
with h1 r1 = h2 r2 . But this implies h1 h2 and hence h1 = h2 and
r1 = r2 . So it follows that (0, 1] is the countable union of disjoint sets
(H r).
(0, 1] =
r(0,1]
rational
If we assume that H B((0, 1]) then
(H r) =
((0, 1]) =
r(0,1]
rational
(H r).
r(0,1]
rational
By B((0, 1]) L we have (H r) = (H) = a 0 for all rational numbers

r (0, 1]. Consequently,
(H r) = a + a + . . .
1 = ((0, 1]) =
r(0,1]
rational
So, the right hand side can either be 0 (if a = 0) or (if a > 0). This leads
to a contradiction, so H B((0, 1]).
28
Chapter 2
Random variables
Given a probability space (, F, P), in many stochastic models one considers
functions f : , which describe certain random phenomena, and is
interested in the computation of expressions like
P ({ : f () (a, b)}) ,
where a < b.
This yields us to the condition

{ : f () (a, b)} F
and hence to random variables we introduce now.
2.1
Random variables
We start with the most simple random variables.

Definition 2.1.1 [(measurable) step-function] Let (, F) be a measurable space. A function f : is called measurable step-function
or step-function, provided that there are 1 , ..., n and A1 , ..., An F
such that f can be written as
n
f () =
i 1IAi (),
i=1
where
1IAi () :=
1 : Ai
.
0 : Ai
Some particular examples for step-functions are

1I
1I
1IA + 1IAc
1IAB
1IAB
=
=
=
=
=
1,
0,
1,
1IA 1IB ,
1IA + 1IB 1IAB .
29
30
CHAPTER 2. RANDOM VARIABLES
The definition above concerns only functions which take finitely many values,
which will be too restrictive in future. So we wish to extend this definition.
Definition 2.1.2 [random variables] Let (, F) be a measurable space.
A map f : is called random variable provided that there is a
sequence (fn )
n=1 of measurable step-functions fn : such that
f () = lim fn () for all .
n
Does our definition give what we would like to have? Yes, as we see from
Proposition 2.1.3 Let (, F) be a measurable space and let f :
a function. Then the following conditions are equivalent:
be
(1) f is a random variable.

(2) For all < a < b < one has that
f 1 ((a, b)) := { : a < f () < b} F.
Proof. (1) = (2) Assume that
f () = lim fn ()
n
where fn : are measurable step-functions. For a measurable stepfunction one has that
fn1 ((a, b)) F
so that
f 1 ((a, b)) =
: a < lim fn () < b

n
:a+
=
m=1 N =1 n=N
1
1
< fn () < b
m
m
F.
(2) = (1) First we observe that we also have that

f 1 ([a, b)) = { : a f () < b}
:a
=
m=1
1
< f () < b
m
so that we can use the step-functions

4n 1
fn () :=
k=4n
k
1I k
k+1 ().
2n { 2n f < 2n }
Sometimes the following proposition is useful which is closely connected to

Proposition 2.1.3.
2.2. MEASURABLE MAPS
31
Proposition 2.1.4 Assume a measurable space (, F) and a sequence of

random variables fn : such that f () := limn fn () exists for all
. Then f : is a random variable.
Proposition 2.1.5 [properties of random variables] Let (, F) be a
measurable space and f, g : random variables and , . Then
the following is true:
(1) (f + g)() := f () + g() is a random variable.
(2) (f g)() := f ()g() is a random-variable.
(3) If g() = 0 for all , then
f
g
() :=
f ()
g()
is a random variable.
(4) |f | is a random variable.

Proof. (2) We find measurable step-functions fn , gn : such that
f () = lim fn () and g() = lim gn ().
n
Hence
(f g)() = lim fn ()gn ().
n
Finally, we remark, that fn ()gn () is a measurable step-function. In fact,

assuming that
k
fn () =
i 1IAi () and gn () =
i=1
j 1IBj (),
j=1
yields
k
(fn gn )() =
i j 1IAi Bj ()
i j 1IAi ()1IBj () =
i=1 j=1
i=1 j=1
and we again obtain a step-function, since Ai Bj F. Items (1), (3), and

(4) are an exercise.
2.2
Measurable maps
Now we extend the notion of random variables to the notion of measurable

maps, which is necessary in many considerations and even more natural.
32
Definition 2.2.1 [measurable map] Let (, F) and (M, ) be measurable

spaces. A map f : M is called (F, )-measurable, provided that
f 1 (B) = { : f () B} F
for all B .
The connection to the random variables is given by

Proposition 2.2.2 Let (, F) be a measurable space and f : . Then
the following assertions are equivalent:
(1) The map f is a random variable.
(2) The map f is (F, B())-measurable.
For the proof we need
Lemma 2.2.3 Let (, F) and (M, ) be measurable spaces and let f :
M . Assume that 0 is a system of subsets such that (0 ) = . If
f 1 (B) F
for all B 0 ,
f 1 (B) F
for all B .
then
Proof. Define
A := B M : f 1 (B) F .
Obviously, 0 A. We show that A is a algebra.
(1) f 1 (M ) = F implies that M A.
(2) If B A, then
f 1 (B c ) =
=
=
=
{ : f () B c }
{ : f ()
/ B}
\ { : f () B}
f 1 (B)c F.
(3) If B1 , B2 , A, then
f 1
Bi
i=1
f 1 (Bi ) F.
=
i=1
By definition of = (0 ) this implies that A, which implies our

lemma.
Proof of Proposition 2.2.2. (2) = (1) follows from (a, b) B() for a < b
which implies that f 1 ((a, b)) F.
(1) = (2) is a consequence of Lemma 2.2.3 since B() = ((a, b) : <
a < b < ).
2.2. MEASURABLE MAPS

Example 2.2.4 If f :
measurable.
33
is continuous, then f is (B(), B())-
Proof. Since f is continuous we know that f 1 ((a, b)) is open for all <
a < b < , so that f 1 ((a, b)) B(). Since the open intervals generate
B() we can apply Lemma 2.2.3.
Now we state some general properties of measurable maps.
Proposition 2.2.5 Let (1 , F1 ), (2 , F2 ), (3 , F3 ) be measurable spaces.
Assume that f : 1 2 is (F1 , F2 )-measurable and that g : 2 3 is
(F2 , F3 )-measurable. Then the following is satisfied:
(1) g f : 1 3 defined by
(g f )(1 ) := g(f (1 ))
is (F1 , F3 )-measurable.
(2) Assume that
P is a probability measure on F1 and define

(B2 ) := P ({1 1 : f (1 ) B2 }) .
Then is a probability measure on F2 .

Example 2.2.6 We want to simulate the flipping of an (unfair) coin by the
random number generator: the random number generator of the computer
gives us a number which has (a discrete) uniform distribution on [0, 1]. So
we take the probability space ([0, 1], B([0, 1]), ) and define for p (0, 1) the
random variable
f () := 1I[0,p) ().
Then it holds
({1}) :=
({0}) :=
P (1 1 : f (1) = 1) = ([0, p)) = p,

P (1 1 : f (1) = 0) = ([p, 1]) = 1 p.
Assume the random number generator gives out the number x. If we would
write a program such that output = heads in case x [0, p) and output
= tails in case x [p, 1], output would simulate the flipping of an (unfair)
coin, or in other words, output has binomial distribution 1,p .
Definition 2.2.7 [law of a random variable] Let (, F, P) be a probability space and f : be a random variable. Then
Pf (B) := P ( : f () B)
is called the law of the random variable f .
34
The law of a random variable is completely characterized by its distribution

function, we introduce now.
Definition 2.2.8 [distribution-function] Given a random variable f :
on a probability space (, F, P), the function
Ff (x) := P( : f () x)
is called distribution function of f .
Proposition 2.2.9 [Properties of distribution-functions]

The distribution-function Ff : [0, 1] is a right-continuous nondecreasing function such that
lim F (x) = 0 and
lim F (x) = 1.
Proof. (i) F is non-decreasing: given x1 < x2 one has that

{ : f () x1 } { : f () x2 }
and
F (x1 ) = P({ : f () x1 }) P({ : f () x2 }) = F (x2 ).
(ii) F is right-continuous: let x and xn x. Then
F (x) =
=
P({ : f () x})
P
{ : f () xn }
n=1
= lim P ({ : f () xn })
n
= lim F (xn ).
n
(iii) The properties limx F (x) = 0 and limx F (x) = 1 are an exercise.
Proposition 2.2.10 Assume that 1 and 2 are probability measures on

B() and F1 and F2 are the corresponding distribution functions. Then the
following assertions are equivalent:
(1) 1 = 2 .
(2) F1 (x) = 1 ((, x]) = 2 ((, x]) = F2 (x) for all x .
2.3. INDEPENDENCE
35
Proof. (1) (2) is of course trivial. We consider (2) (1): For sets of type
A := (a1 , b1 ] (an , bn ],
where the intervals are disjoint, one can show that
n
(F1 (bi ) (F1 (ai )) = 1 (A) = 2 (A) =

i=1
(F2 (bi ) (F2 (ai )).

i=1
Now one can apply Caratheodorys extension theorem.

Summary: Let (, F) be a measurable space and f : be a function.
Then the following relations hold true:
f 1 (A) F for all A G
where G is one of the systems given in Proposition 1.1.7 or
any other system such that (G) = B().
Lemma 2.2.3
f is measurable: f 1 (A) F for all A B()
Proposition 2.2.2
There exist measurable step functions (fn )
n=1 i.e.
Nn
n
fn = k=1 ak 1IAnk
n
with ak and Ank F such that
fn () f () for all as n .
2.3
Independence
Let us first start with the notion of a family of independent random variables.
Definition 2.3.1 [independence of a family of random variables]
Let (, F, P) be a probability space and fi : , i I, be random variables where I is a non-empty index-set. The family (fi )iI is called independent provided that for all i1 , ..., in I, n = 1, 2, ..., and all B1 , ..., Bn B()
one has that
P (fi
B1 , ..., fin Bn ) = P (fi1 B1 ) P (fin Bn ) .
36
In case, we have a finite index set I, that means for example I = {1, ..., n},
then the definition above is equivalent to
Definition 2.3.2 [independence of a finite family of random variables] Let (, F, P) be a probability space and fi : , i = 1, . . . , n,
random variables. The random variables f1 , . . . , n are called independent
provided that for all B1 , ..., Bn B() one has that
P (f1 B1, ..., fn Bn) = P (f1 B1) P (fn Bn) .

We already defined in Definition 1.2.9 what does it mean that a sequence of
events is independent. Now we rephrase this definition for arbitrary families.
Definition 2.3.3 [independence of a family of events] Let (, F, P)

be a probability space and I be a non-empty index-set. A family (Ai )iI ,
Ai F, is called independent provided that for all i1 , ..., in I, n = 1, 2, ...,
one has that
P (Ai1 Ain ) = P (Ai1 ) P (Ain ) .
The connection between the definitions above is obvious:
Proposition 2.3.4 Let (, F, P) be a probability space and fi : ,
i I, be random variables where I is a non-empty index-set. Then the
following assertions are equivalent.
(1) The family (fi )iI is independent.
(2) For all families (Bi )iI of Borel sets Bi B() one has that the events
({ : fi () Bi })iI are independent.
Sometimes we need to group independent random variables. In this respect
the following proposition turns out to be useful. For the following we say
that g : n is Borel-measurable provided that g is (B(n ), B())measurable.
Proposition 2.3.5 [Grouping of independent random variables]
Let fk : , k = 1, 2, 3, ... be independent random variables. Assume Borel functions gi : ni for i = 1, 2, ... and ni {1, 2, ...}.
Then the random variables g1 (f1 (), ..., fn1 ()), g2 (fn1 +1 (), ..., fn1 +n2 ()),
g3 (fn1 +n2 +1 (), ..., fn1 +n2 +n3 ()), ... are independent.
2.3. INDEPENDENCE
37
Proposition 2.3.6 [independence and product of laws] Assume that

(, F, P) is a probability space and that f, g : are random variables
with laws Pf and Pg and distribution-functions Ff and Fg , respectively. Then
the following assertions are equivalent:
(1) f and g are independent.
(2)
(3)
P ((f, g) B) = (Pf Pg )(B) for all B B(2).

P(f x, g y) = Ff (x)Ff (y) for all x, y .

Remark 2.3.7 Assume that there are Riemann-integrable functions pf , pg :
[0, ) such that
pf (x)dx =
pg (x)dx = 1,
x
pf (y)dy,
Ff (x) =
pg (y)dy
and Fg (x) =
for all x (one says that the distribution-functions Ff and Fg are absolutely continuous with densities pf and pg , respectively). Then the independence of f and g is also equivalent to
x
pf (u)pg (v)d(u)d(v).
F(f,g) (x, y) =
In other words: the distribution-function of the random vector (f, g) has a

density which is the product of the densities of f and g.
Often one needs the existence of sequences of independent random variables

f1 , f2 , : having a certain distribution. How to construct such
sequences? First we let
:= = {x = (x1 , x2 , ...) : xn } .
Then we define the projections n : given by
n (x) := xn ,
that means n filters out the n-th coordinate. Now we take the smallest
-algebra such that all these projections are random variables, that means
we take
B( ) := 1
n (B) : n = 1, 2, ..., B B() ,
38
see Proposition 1.1.5. Finally, let P1 , P2 , ... be a sequence of measures on

odorys extension theorem (Proposition 1.2.14) we
B(). Using Carathe
find an unique probability measure P on B( ) such that
P(B1 B2 Bn ) = P1(B1) Pn(Bn)

for all n = 1, 2, ... and B1 , ..., Bn B(), where
B1 B2 Bn := x : x1 B1 , ..., xn Bn
Proposition 2.3.8 [Realization of independent random variables] Let ( , B( ), P) and n : be defined as above. Then (n )
n=1
is a sequence of independent random variables such that the law of n is Pn ,
that means
P(n B) = Pn(B)
for all B B().
Proof. Take Borel sets B1 , ..., Bn B(). Then
P({ : 1() B1, ..., n() Bn})

= P(B1 B2 Bn )
= P1 (B1 ) Pn (Bn )
n
=
P( Bk )
k=1
n
=
k=1
P({ : k () Bk }).
Chapter 3
Integration
Given a probability space (, F, P) and a random variable f :
define the expectation or integral
f =
f dP =
, we
f ()dP()
and investigate its basic properties.
3.1
Definition of the expected value
The definition of the integral is done within three steps.

Definition 3.1.1 [step one, f is a step-function] Given a probability
space (, F, P) and an F-measurable g : with representation
n
g=
i 1IAi
i=1
where i and Ai F, we let
g =
gdP =
g()dP() :=
i P(Ai ).
i=1
We have to check that the definition is correct, since it might be that different
representations give different expected values g. However, this is not the
case as shown by
Lemma 3.1.2 Assuming measurable step-functions
n
g=
i 1IAi =
i=1
one has that
n
i=1
i P(Ai ) =
j 1IBj ,
j=1
m
j=1
j P(Bj ).
39
40
CHAPTER 3. INTEGRATION
Proof. By subtracting in both equations the right-hand side from the lefthand one we only need to show that
n
i 1IAi = 0
i=1
implies that
n
i P(Ai ) = 0.
i=1
By taking all possible intersections of the sets Ai and by adding appropriate

complements we find a system of sets C1 , ..., CN F such that
(a) Cj Ck = if j = k,
(b)
N
j=1
Cj = ,
(c) for all Ai there is a set Ii {1, ..., N } such that Ai =
jIi
Cj .
Now we get that

n
0=
i 1IAi =
i=1
i 1ICj =
i=1 jIi
i 1ICj =
j=1
j 1ICj
j=1
i:jIi
so that j = 0 if Cj = . From this we get that

n
i P(Ai ) =
i=1
i P(Cj ) =
i
j=1
i=1 jIi
P(Cj ) =
j P(Cj ) = 0.
j=1
i:jIi
Proposition 3.1.3 Let (, F, P) be a probability space and f, g : be

measurable step-functions. Given , one has that
(f + g) = f + g.
Proof. The proof follows immediately from Lemma 3.1.2 and the definition
of the expected value of a step-function since, for
n
f=
i 1IAi
and
g=
i=1
j 1IBj ,
j=1
one has that
f + g =
i 1IAi +
i=1
j 1IBj
j=1
and
(f + g) =
i=1
i P(Ai ) +
j=1
j P(Bj ) = f + g.
3.1. DEFINITION OF THE EXPECTED VALUE
41
Definition 3.1.4 [step two, f is non-negative] Given a probability

space (, F, P) and a random variable f : with f () 0 for all
. Then
f =
f dP =
f ()dP()
:= sup {g : 0 g() f (), g is a measurable step-function} .
Note that in this definition the case f = is allowed. In the last step we
define the expectation for a general random variable.
Definition 3.1.5 [step three, f is general] Let (, F, P) be a probability space and f : be a random variable. Let
f + () := max {f (), 0}
(1) If f + < or
exists and set
and f () := max {f (), 0} .
f < , then we say that the expected value of f

f := f + f [, ].
(2) The random variable f is called integrable provided that
f + <
and
f < .
(3) If f is integrable and A F, then
f dP =
f ()dP() :=
f ()1IA ()dP().
The expression f is called expectation or expected value of the random

variable f .
For the above definition note that f + () 0, f () 0, and
f () = f + () f ().
Remark 3.1.6 In case, we integrate functions with respect to the Lebesgue
measure introduced in Section 1.3.4, the expected value is called Lebesgue
integral and the integrable random variables are called Lebesgue integrable functions.
Besides the expected value, the variance is often of interest.
Definition 3.1.7 [variance] Let (, F, P) be a probability space and f :
be a random variable. Then 2 = [f f ]2 is called variance.
42
A simple example for the expectation is the expected value while rolling a
die:
Example 3.1.8 Assume that := {1, 2, . . . , 6}, F := 2 , and
which models rolling a die. If we define f (k) = k, i.e.
P({k}) := 16 ,
f (k) :=
i1I{i} (k),
i=1
then f is a measurable step-function and it follows that

6
f =
i=1
3.2
iP({i}) =
1 + 2 + + 6
= 3.5.
6
Basic properties of the expected value
We say that a property P(), depending on , holds

almost surely (a.s.) if
P-almost surely or
{ : P() holds}
belongs to F and is of measure one. Let us start with some first properties
of the expected value.
Proposition 3.2.1 Assume a probability space (, F, P) and random variables f, g : .
(1) If 0 f () g(), then 0 f g.
(2) The random variable f is integrable if and only if |f | is integrable. In

this case one has
|f | |f |.
f = 0.
If f 0 a.s. and f = 0, then f = 0 a.s.
If f = g a.s. and f exists, then g exists and f = g.
(3) If f = 0 a.s., then

(4)
(5)
Proof. (1) follows directly from the definition. Property (2) can be seen
as follows: by definition, the random variable f is integrable if and only if
f + < and f < . Since
: f + () = 0 : f () = 0 =
and since both sets are measurable, it follows that |f | = f + +f is integrable
if and only if f + and f are integrable and that
|f | = |f + f | f + + f = |f |.
3.2. BASIC PROPERTIES OF THE EXPECTED VALUE
43
(3) If f = 0 a.s., then f + = 0 a.s. and f = 0 a.s., so that we can restrict

ourself to the case f () 0. If g is a measurable step-function with g =
n
k=1 ak 1IAk , g() 0, and g = 0 a.s., then ak = 0 implies P(Ak ) = 0. Hence
f = sup {g : 0 g f, g is a measurable step-function} = 0

since 0 g f implies g = 0 a.s. Properties (4) and (5) are exercises.
The next lemma is useful later on. In this lemma we use, as an approximation
for f , a staircase-function. This idea was already exploited in the proof of
Proposition 2.1.3.
Lemma 3.2.2 Let (, F, P) be a probability space and f :
random variable.
be a
(1) Then there exists a sequence of measurable step-functions fn :

such that, for all n = 1, 2, . . . and for all ,
|fn ()| |fn+1 ()| |f ()| and
f () = lim fn ().
n
If f () 0 for all , then one can arrange fn () 0 for all

.
(2) If f 0 and if (fn )
n=1 is a sequence of measurable step-functions with
0 fn () f () for all as n , then
f = n
lim fn .
Proof. (1) It is easy to verify that the staircase-functions
4n 1
fn () :=
k=4n
k
1I k
k+1 ().
2n { 2n f < 2n }
fulfill all the conditions.

(2) Letting
4n 1
fn0 () :=
k=0
k
1I k
k+1 ()
2n { 2n f < 2n }
we get 0 fn0 () f () for all . On the other hand, by the definition

of the expectation there exits a sequence 0 gn () f () of measurable
step-functions such that gn f . Hence
hn := max fn0 , g1 , . . . , gn
is a measurable step-function with 0 gn () hn () f (),
gn hn f,
and
lim
gn = n
lim hn = f.
44
Consider
dk,n := fk hn .
Clearly, dk,n fk as n and dk,n hn as k . Let
zk,n := arctan dk,n
so that 0 zk,n 1. Since (zk,n )

k=1 is increasing for fixed n and (zk,n )n=1 is
increasing for fixed k one quickly checks that
lim lim zk,n = lim lim zk,n .

k
Hence
f = lim
hn = lim
lim dk,n = lim lim dk,n = fn
n
n
n
k
k
where we have used the following fact: if 0 n () () for step-functions

n and , then
lim n = .
n
To check this, it is sufficient to assume that () = 1IA () for some A F.

Let (0, 1) and
Bn := { A : 1 n ()} .
Then
(1 )1IBn () n () 1IA ().
n
Since Bn Bn+1 and
n=1 B = A we get, by the monotonicity of the
n
measure, that limn P(B ) = P(A) so that
(1 )P(A) lim n .
n
Since this is true for all > 0 we get
= P(A) lim
n
n
and are done.
Now we continue with same basic properties of the expectation.
Proposition 3.2.3 [properties of the expectation] Let (, F, P) be
a probability space and f, g : be random variables such that f and
g exist.
< or f + g < , then
(f + g) < and (f + g) = f + g.
(1) If
f + + g+
(2) If c , then
(3)
(cf ) exists and (cf ) = cf .

If f g, then f g.
(f + g)+
< or

(4) If f and g are integrable and a, b
af + bg = (af + bg).
45
, then af + bg is integrable and
Proof. (1) We only consider the case that f + + g + < . Because of

(f + g)+ f + + g + one gets that (f + g)+ < . Moreover, one quickly
checks that
(f + g)+ + f + g = f + + g + + (f + g)
so that f + g = if and only if (f + g) = if and only if
f + g = (f + g) = . Assuming that f + g < gives that
(f + g) < and
(f + g)+ + f + g = f + + g+ + (f + g)
(3.1)
which implies that (f + g) = f + g. In order to prove Formula (3.1) we

assume random variables , : such that 0 and 0. We find
measurable step functions (n )

n=1 and (n )n=1 with
0 n () ()
and
0 n () ()
for all . Lemma 3.2.2, Proposition 3.1.3, and n () + n () () +

() give that
+ = lim
n + lim
n = lim
(n + n) = ( + ).
n
n
n
(2) is an exercise.
(3) If f = or g + = , then f = or g = so that nothing

is to prove. Hence assume that f < and g + < . The inequality
f g gives 0 f + g + and 0 g f so that f and g are integrable
and
f = f + f g+ g = g.
(4) Since (af + bg)+ |a||f | + |b||g| and (af + bg) |a||f | + |b||g| we get
that af + bg is integrable. The equality for the expected values follows from
(1) and (2).
Proposition 3.2.4 [monotone convergence] Let (, F, P) be a probability space and f, f1 , f2 , ... : be random variables.
(1) If 0 fn () f () a.s., then limn fn = f .
(2) If 0 fn () f () a.s., then limn fn = f .
Proof. (a) First suppose
0 fn () f ()
for all .
46
For each fn take a sequence of step functions (fn,k )k1 such that 0 fn,k fn ,
as k . Setting
hN := max fn,k
1kN
1nN
we get hN 1 hN max1nN fn = fN . Define h := limN hN . For

1 n N it holds that
fn,N hN fN ,
fn h f,
and therefore
f = lim fn h f.
n
Since hN is a step function for each N and hN f we have by Lemma 3.2.2

that limN hN = f and therefore, since hN fN ,
f Nlim
fN .
On the other hand, fn fn+1 f implies

lim
fn f and hence
fn f.
(b) Now let 0 fn () f () a.s. By definition, this means that

0 fn () f () for all \ A,
where P(A) = 0. Hence 0 fn ()1IAc () f ()1IAc () for all and step
(a) implies that
lim fn 1IAc = f 1IAc .
n
Since fn 1IAc = fn a.s. and f 1IAc = f a.s. we get

(f 1IAc ) = f by Proposition 3.2.1 (5).
(fn1IA )
c
fn
and
(c) Assertion (2) follows from (1) since 0 fn f implies 0 fn f.
Corollary 3.2.5 Let (, F, P) be a probability space and g, f, f1 , f2 , ... :

be random variables, where g is integrable. If
(1) g() fn () f () a.s. or
(2) g() fn () f () a.s.,
then limn fn = f .
47
Proof. We only consider (1). Let hn := fn g and h := f g. Then

0 hn () h() a.s.
Proposition 3.2.4 implies that limn hn = h. Since fn and f are integrable Proposition 3.2.3 (1) implies that hn = fn g and h = f g
so that we are done.
Proposition 3.2.6 [Lemma of Fatou] Let (, F, P) be a probability space
and g, f1 , f2 , ... : be random variables with |fn ()| g() a.s. Assume that g is integrable. Then lim sup fn and lim inf fn are integrable and
one has that
lim
inf fn
n
lim inf fn lim sup fn

n
lim sup fn.
Proof. We only prove the first inequality. The second one follows from the
definition of lim sup and lim inf, the third one can be proved like the first
one. So we let
Zk := inf fn
nk
so that Zk lim inf n fn and, a.s.,

|Zk | g
and | lim inf fn | g.

n
Applying monotone convergence in the form of Corollary 3.2.5 gives that
limninf fn = lim
Zk = lim
nk
inf fn
k
k
lim inf
k
nk
fn
= lim inf fn .
n
Proposition 3.2.7 [Lebesgues Theorem, dominated convergence]

Let (, F, P) be a probability space and g, f, f1 , f2 , ... : be random
variables with |fn ()| g() a.s. Assume that g is integrable and that
f () = limn fn () a.s. Then f is integrable and one has that
f = lim
fn.
n
Proof. Applying Fatous Lemma gives
f = lim
inf fn
n
lim inf fn lim sup fn

n
lim sup fn = f.
n
Finally, we state a useful formula for independent random variable.

Proposition 3.2.8 If f and g are independent and
, then |f g| < and
f g = f f.
|f | < and |g| <
48
3.3
Connections to the Riemann-integral
In two typical situations we formulate (without proof) how our expected

value connects to the Riemann-integral. For this purpose we use the Lebesgue
measure defined in Section 1.3.4.
Proposition 3.3.1 Let f : [0, 1] be a continuous function. Then
1
0
f (x)dx = f
with the Riemann-integral on the left-hand side and the expectation of the
random variable f with respect to the probability space ([0, 1], B([0, 1]), ),
where is the Lebesgue measure, on the right-hand side.
Now we consider a continuous function p : [0, ) such that
p(x)dx = 1
and define a measure
P on B() by
P((a1, b1] (an, bn]) :=
bi
p(x)dx
i=1
ai
for a1 b1 an bn (again with the convention that

(a, ] = (a, )) via Caratheodorys Theorem (Proposition 1.2.14). The
function p is called density of the measure P.
Proposition 3.3.2 Let f : be a continuous function such that
|f (x)|p(x)dx < .
Then
f (x)p(x)dx = f
with the Riemann-integral on the left-hand side and the expectation of the
random variable f with respect to the probability space (, B(), P) on the
right-hand side.
Let us consider two examples indicating the difference between the Riemannintegral and our expected value.
3.4. CHANGE OF VARIABLES IN THE EXPECTED VALUE
49
Example 3.3.3 We give the standard example of a function which has an

expected value, but which is not Riemann-integrable. Let
1, x [0, 1] irrational
.
0, x [0, 1] rational
f (x) :=
Then f is not Riemann integrable, but Lebesgue integrable with

we use the probability space ([0, 1], B([0, 1]), ).
f = 1 if
Example 3.3.4 The expression

t
lim
sin x
dx =
x
2
is defined as limit in the Riemann sense although
sin x
x
dx = and
0
sin x
x
dx = .
Transporting this into a probabilistic setting we take the exponential distribution with parameter > 0 from Section 1.3.6. Let f : be
x x
given by f (x) = 0 if x 0 and f (x) := sin
e if x > 0 and recall that the
x
exponential distribution with parameter > 0 is given by the density
p (x) = 1I[0,) (x)ex . The above yields that
t
lim
f (x)p (x)dx =
0
but
f (x)+ d (x) =
f (x) d (x) = .
Hence the expected value of f does not exists, but the Riemann-integral gives
a way to define a value, which makes sense. The point of this example is
that the Riemann-integral takes more information into the account than the
rather abstract expected value.
3.4
Change of variables in the expected value
We want to prove a change of variable formula for the integrals f dP. In

many cases, only by this formula it is possible to compute explicitly expected
values.
Proposition 3.4.1 [Change of variables] Let (, F, P) be a probability
space, (E, E) be a measurable space, : E be a measurable map, and
g : E be a random variable. Assume that P is the image measure of
P with respect to , that means
P(A) = P({ : () A}) = P(1(A))
for all
A E.
50
Then
A
g()dP () =
1 (A)
g(())dP()
for all A E in the sense that if one integral exists, the other exists as well,
and their values are equal.
Proof. (i) Letting g() := 1IA ()g() we have
g(()) = 1I1 (A) ()g(())
so that it is sufficient to consider the case A = . Hence we have to show
that
E
g()dP () =
g(())dP().
(ii) Since, for f () := g(()) one has that f + = g + and f = g it is

sufficient to consider the positive part of g and its negative part separately.
In other words, we can assume that g() 0 for all E.
(iii) Assume now a sequence of measurable step-function 0 gn () g()
for all E which does exist according to Lemma 3.2.2 so that gn (())
g(()) for all as well. If we can show that
gn ()dP () =
gn (())dP()
then we are done. By additivity it is enough to check gn () = 1IB () for some

B E (if this is true for this case, then one can multiply by real numbers
and can take sums and the equality remains true). But now we get
gn ()dP () = P (B) = P(1 (B)) =

=
E
1I1 (B) ()dP()
1IB (())dP() =
gn (())dP().
Let us give two examples for the change of variable formula.

Example 3.4.2 [Computation of moments] We want to compute certain moments. Let (, F, P) be a probability space and : be a
random variable. Let P be the law of and assume that the law has a
continuous density p, that means we have that
Pf ((a, b]) =
p(x)dx
a
3.5. FUBINIS THEOREM
51
for all < a < b < where p : [0, ) is a continuous function such
that p(x)dx = 1 using the Riemann-integral. Letting n {1, 2, ...} and

g(x) := xn , we get that
() dP() =
n
g(x)dP (x) =
xn p(x)dx
where we have used Proposition 3.3.2.
Example 3.4.3 [Discrete image measures] Assume the setting of

Proposition 3.4.1 and that
P =
pk k
k=1
with pk 0,
k=1 pk = 1, and some k E (that means that the image
measure of P with respect to is discrete). Then
3.5
g(())dP() =
g()dP () =
pk g(k ).
k=1
Fubinis Theorem
In this section we consider iterated integrals, as they appear very often in

applications, and show in Fubinis Theorem that integrals with respect to
product measures can be written as iterated integrals and that one can change
the order of integration in these iterated integrals. In many cases this provides an appropriate tool for the computation of integrals. Before we start
with Fubinis Theorem we need some preparations. First we recall the notion of a vector space.
Definition 3.5.1 [vector space] A set L equipped with operations + :
L L L and : L L is called vector space over if the following
conditions are satisfied:
(1) x + y = y + x for all x, y L.
(2) x + (y + z) = (x + y) + z form all x, y, z L.
(3) There exists an 0 L such that x + 0 = x for all x L.
(4) For all x L there exists an x such that x + (x) = 0.
(5) 1 x = x.
(6) (x) = ()x for all , and x L.
52
(7) ( + )x = x + x for all , and x L.

(8) (x + y) = x + y for all and x, y L.
Usually one uses the notation x y := x + (y) and x + y := (x) + y etc.
Now we state the Monotone Class Theorem. It is a powerful tool by which,
for example, measurability assertions can be proved.
Proposition 3.5.2 [Monotone Class Theorem] Let H be a class of
bounded functions from into satisfying the following conditions:
(1) H is a vector space over
and are used.
where the natural point-wise operations +
(2) 1I H.
(3) If fn H, fn 0, and fn f , where f is bounded on , then f H.
Then one has the following: if H contains the indicator function of every
set from some -system I of subsets of , then H contains every bounded
(I)-measurable function on .
Proof. See for example [5] (Theorem 3.14).
For the following it is convenient to allow that the random variables may
take infinite values.
Definition 3.5.3 [extended random variable] Let (, F) be a measurable space. A function f : {, } is called extended random
variable if
f 1 (B) := { : f () B} F
for all
B B().
If we have a non-negative extended random variable, we let (for example)
f dP = lim
[f N ]dP.
For the following, we recall that the product space (1 2 , F1 F2 , P1 P2 )

of the two probability spaces (1 , F1 , P1 ) and (2 , F2 , P2 ) was defined in
Definition 1.2.15.
Proposition 3.5.4 [Fubinis Theorem for non-negative functions]
Let f : 1 2 be a non-negative F1 F2 -measurable function such
that
f (1 , 2 )d(P1 P2 )(1 , 2 ) < .
(3.2)
1 2
Then one has the following:
53
(1) The functions 1 f (1 , 20 ) and 2 f (10 , 2 ) are F1 -measurable

and F2 -measurable, respectively, for all i0 i .
(2) The functions
1
2
f (1 , 2 )dP2 (2 ) and
2
1
f (1 , 2 )dP1 (1 )
are extended F1 -measurable and F2 -measurable, respectively, random

variables.
(3) One has that
1 2
f (1 , 2 )d(P1 P2 ) =
f (1 , 2 )dP2 (2 ) dP1 (1 )
f (1 , 2 )dP1 (1 ) dP2 (2 ).
It should be noted, that item (3) together with Formula (3.2) automatically
implies that
and
P2
2 :
P1
1 :
f (1 , 2 )dP1 (1 ) =
=0
f (1 , 2 )dP2 (2 ) =
= 0.
Proof of Proposition 3.5.4.

(i) First we remark it is sufficient to prove the assertions for
fN (1 , 2 ) := min {f (1 , 2 ), N }
which is bounded. The statements (1), (2), and (3) can be obtained via
N if we use Proposition 2.1.4 to get the necessary measurabilities
(which also works for our extended random variables) and the monotone
convergence formulated in Proposition 3.2.4 to get to values of the integrals.
Hence we can assume for the following that sup1 ,2 f (1 , 2 ) < .
(ii) We want to apply the Monotone Class Theorem Proposition 3.5.2. Let
H be the class of bounded F1 F2 -measurable functions f : 1 2
such that
(a) the functions 1 f (1 , 20 ) and 2 f (10 , 2 ) are F1 -measurable
and F2 -measurable, respectively, for all i0 i ,
(b) the functions
1
2
f (1 , 2 )dP2 (2 ) and 2
f (1 , 2 )dP1 (1 )
are F1 -measurable and F2 -measurable, respectively,
54
(c) one has that

f (1 , 2 )d(P1 P2 ) =
1 2
f (1 , 2 )dP2 (2 ) dP1 (1 )
f (1 , 2 )dP1 (1 ) dP2 (2 ).
Again, using Propositions 2.1.4 and 3.2.4 we see that H satisfies the assumptions (1), (2), and (3) of Proposition 3.5.2. As -system I we take the system
of all F = AB with A F1 and B F2 . Letting f (1 , 2 ) = 1IA (1 )1IB (2 )
we easily can check that f H. For instance, property (c) follows from
1 2
f (1 , 2 )d(P1 P2 ) = (P1 P2 )(A B) = P1 (A)P2 (B)
and, for example,
f (1 , 2 )dP2 (2 ) dP1 (1 ) =
=
1IA (1 )P2 (B)dP1 (1 )
P1(A)P2(B).
Applying the Monotone Class Theorem Proposition 3.5.2 gives that H consists of all bounded functions f : 1 2 measurable with respect
F1 F2 . Hence we are done.
Now we state Fubinis Theorem for general random variables f : 1 2
.
Proposition 3.5.5 [Fubinis Theorem] Let f : 1 2
F1 F2 -measurable function such that
1 2
|f (1 , 2 )|d(P1 P2 )(1 , 2 ) < .
be an
(3.3)
Then the following holds:

(1) The functions 1 f (1 , 20 ) and 2 f (10 , 2 ) are F1 -measurable
and F2 -measurable, respectively, for all i0 i .
(2) The are Mi Fi with
Pi(Mi) = 1 such that the integrals
f (1 , 20 )dP1 (1 ) and
exist and are finite for all i0 Mi .
f (10 , 2 )dP1 (2 )
55
(3) The maps

1 1IM1 (1 )
2
and
2 1IM2 (2 )
1
f (1 , 2 )dP2 (2 )
f (1 , 2 )dP1 (1 )
are F1 -measurable and F2 -measurable, respectively, random variables.

(4) One has that
1 2
f (1 , 2 )d(P1 P2 )
1IM1 (1 )
1
1IM2 (2 )
2
f (1 , 2 )dP2 (2 ) dP1 (1 )
f (1 , 2 )dP1 (1 ) dP2 (2 ).
Remark 3.5.6 (1) Our understanding is that writing, for example, an

expression like
1IM2 (2 )
1
f (1 , 2 )dP1 (1 )
we only consider and compute the integral for 2 M2 .

(2) The expressions in (3.2) and (3.3) can be replaced by
f (1 , 2 )dP2 (2 ) dP1 (1 ) < ,
and the same expression with |f (1 , 2 )| instead of f (1 , 2 ), respectively.
Proof of Proposition 3.5.5. The proposition follows by decomposing f =

f + f and applying Proposition 3.5.4.
In the following example we show how to compute the integral
ex dx
by Fubinis Theorem.
56
Example 3.5.7 Let f : be a non-negative continuous function.

Fubinis Theorem applied to the uniform distribution on [N, N ], N
{1, 2, ...} gives that
N
f (x, y)
N
d(y) d(x)
=
2N
2N
f (x, y)
[N,N ][N,N ]
d( )(x, y)
(2N )2
where is the Lebesgue measure. Letting f (x, y) := e(x

yields that
N
ex ey d(y) d(x) =
N
e(x
2 +y 2 )
2 +y 2 )
, the above
d( )(x, y).
[N,N ][N,N ]
For the left-hand side we get

N
ex ey d(y) d(x)
lim
N
N
ex
lim
N
2
x2
lim
ey d(y) d(x)
N
N
d(x)
N
2
x2
d(x)
For the right-hand side we get

e(x
lim
2 +y 2 )
d( )(x, y)
[N,N ][N,N ]
=
=
e(x
lim
x2 +y 2 R2
R
2
d( )(x, y)
er rdrd
lim
2 +y 2 )
= lim 1 eR
R
=
where we have used polar coordinates. Comparing both sides gives
ex d(x) =
As corollary we show that the definition of the Gaussian measure in Section

1.3.5 was correct.
57
Proposition 3.5.8 For > 0 and m let

pm,2 (x) :=
1
2 2
(xm)2
2 2
Then, pm,2 (x)dx = 1,
xpm,2 (x)dx = m,
and
(x m)2 pm,2 (x)dx = 2 .
In other words: if a random variable f :

distribution Nm,2 , then
f = m
and
(3.4)
has as law the normal
(f f )2 = 2.
(3.5)
Proof. By the change of variable x m + x it is sufficient

to show the
statements for m = 0 and = 1. Firstly, by putting x = z/ 2 one gets
1
1=
1
2
ex dx =
2
z2
e 2 dz
where we have used Example 3.5.7 so that p0,1 (x)dx = 1. Secondly,
xp0,1 (x)dx = 0
follows from the symmetry of the density p0,1 (x) = p0,1 (x). Finally, by
partial integration (use (x exp(x2 /2)) = exp(x2 /2) x2 exp(x2 /2)) one
can also compute that
1
x2
1
x2 e 2 dx =
2
x2
e 2 dx = 1.
We close this section with a counterexample to Fubinis Theorem.

Example 3.5.9 Let = [1, 1] [1, 1] and be the uniform distribution
on [1, 1] (see Section 1.3.4). The function
f (x, y) :=
(x2
xy
+ y 2 )2
for (x, y) = (0, 0) and f (0, 0) := 0 is not integrable on , even though the
iterated integrals exist end are equal. In fact
1
f (x, y)d(x) = 0 and

1
f (x, y)d(y) = 0
1
so that
1
f (x, y)d(x) d(y) =

1
f (x, y)d(y) d(x) = 0.
58
On the other hand, using polar coordinates we get

1
|f (x, y)|d( )(x, y)
4
[1,1][1,1]
0
1
= 2
0
| sin cos |
ddr
r
1
dr = .
r
The inequality holds because on the right hand side we integrate only over
the area {(x, y) : x2 + y 2 1} which is a subset of [1, 1] [1, 1] and
2
/2
| sin cos |d = 4
0
sin cos d = 2
0
follows by a symmetry argument.
3.6
Some inequalities
In this section we prove some basic inequalities.

Proposition 3.6.1 [Chebyshevs inequality] Let f be a non-negative
integrable random variable defined on a probability space (, F, P). Then,
for all > 0,
P({ : f () }) f .
Proof. We simply have
P({ : f () }) = 1I{f } f 1I{f } f.
Definition 3.6.2 [convexity] A function g : is convex if and

only if
g(px + (1 p)y) pg(x) + (1 p)g(y)
for all 0 p 1 and all x, y .
Every convex function g : is (B(), B())-measurable.
Proposition 3.6.3 [Jensens inequality] If g :
f : a random variable with |f | < , then
is convex and
g(f ) g(f )
where the expected value on the right-hand side might be infinity.
3.6. SOME INEQUALITIES
59
Proof. Let x0 = f . Since g is convex we find a supporting line, that

means a, b such that
ax0 + b = g(x0 ) and ax + b g(x)
for all x . It follows af () + b g(f ()) for all and
g(f ) = af + b = (af + b) g(f ).
Example 3.6.4 (1) The function g(x) := |x| is convex so that, for any
integrable f ,
|f | |f |.
(2) For 1 p < the function g(x) := |x|p is convex, so that Jensens
inequality applied to |f | gives that
(|f |)p |f |p .
For the second case in the example above there is another way we can go. It
lder-inequality.
uses the famous Ho
lders inequality] Assume a probability space
Proposition 3.6.5 [Ho
(, F, P) and random variables f, g : . If 1 < p, q < with p1 + 1q = 1,
then
|f g| (|f |p) p1 (|g|q ) 1q .
Proof. We can assume that |f |p > 0 and |g|q > 0. For example, assuming
|f |p = 0 would imply |f |p = 0 a.s. according to Proposition 3.2.1 so that
f g = 0 a.s. and |f g| = 0. Hence we may set
f :=
f
(|f |p )
1
p
and
g :=
g
(|g|q ) q
We notice that
xa y b ax + by
for x, y 0 and positive a, b with a + b = 1, which follows from the concavity
of the logarithm (we can assume for a moment that x, y > 0)
ln(ax + by) a ln x + b ln y = ln xa + ln y b = ln xa y b .
Setting x := |f|p , y := |
g |q , a := p1 , and b := 1q , we get
1
1 q
g|
|fg| = xa y b ax + by = |f|p + |
p
q
60
and
|fg| p1 |f|p + 1q |g|q = p1 + 1q = 1.
On the other hand side,
|f g|
(|f |p ) (|g|q )
|fg| =
1
p
1
q
so that we are done.

Corollary 3.6.6 For 0 < p < q < one has that (|f |p ) p (|f |q ) q .
1

lders inequality for sequences] Let (an )
Corollary 3.6.7 [Ho
n=1
and (bn )
be
sequences
of
real
numbers.
Then
n=1
1
p
|an |p
|an bn |
n=1
1
q
|bn |q
n=1
n=1
Proof. It is sufficient to prove the inequality for finite sequences (bn )N

n=1 since
by letting N we get the desired inequality for infinite sequences. Let
= {1, ..., N }, F := 2 , and P({k}) := 1/N . Defining f, g : by
f (k) := ak and g(k) := bk we get
1
N
|an bn |
n=1
1
N
1
p
|an |p
n=1
1
N
1
q
|bn |q
n=1
from Proposition 3.6.5. Multiplying by N and letting N gives our

assertion.
Proposition 3.6.8 [Minkowski inequality] Assume a probability space

(, F, P), random variables f, g : , and 1 p < . Then
(|f + g|p ) p (|f |p ) p + (|g|p ) p .
1
(3.6)
Proof. For p = 1 the inequality follows from |f + g| |f | + |g|. So assume

that 1 < p < . The convexity of x |x|p gives that
a+b
2
|a|p + |b|p
2
and (a+b)p 2p1 (ap +bp ) for a, b 0. Consequently, |f +g|p (|f |+|g|)p
2p1 (|f |p + |g|p ) and
|f + g|p 2p1(|f |p + |g|p).
3.6. SOME INEQUALITIES
61
Assuming now that (|f |p ) p + (|g|p ) p < , otherwise there is nothing to

prove, we get that |f +g|p < as well by the above considerations. Taking
1 < q < with p1 + 1q = 1, we continue by
1
|f + g|p
|f + g||f + g|p1
(|f | + |g|)|f + g|p1
|f ||f + g|p1 + |g||f + g|p1
(|f |p ) |f + g|(p1)q (|g|p ) |f + g|(p1)q
1
p
1
q
1
p
1
q
lders inequality. Since (p 1)q = p, (3.6) follows

where we have used Ho
1
by dividing the above inequality by (|f + g|p ) q and taking into the account
1 1q = p1 .
We close with a simple deviation inequality for f .
Corollary 3.6.9 Let f be a random variable defined on a probability space
(, F, P) such that f 2 < . Then one has, for all > 0,
(f f )2
f2
P(|f f | )
2 .
2
Proof. From Corollary 3.6.6 we get that |f | < so that

plying Proposition 3.6.1 to |f f |2 gives that
exists. Ap-
P({|f f | }) = P({|f f |2 2}) |f 2f |
Finally, we use that
(f f )2 = f 2 (f )2 f 2.
62
Chapter 4
Modes of convergence
4.1
Definitions
Let us introduce some basic types of convergence.

Definition 4.1.1 [Types of convergence] Let (, F, P) be a probability space and f, f1 , f2 , : be random variables.
(1) The sequence (fn )
n=1 converges almost surely (a.s.) or with probability 1 to f (fn f a.s. or fn f P-a.s.) if and only if
P({ : fn() f ()
as n }) = 1.
(2) The sequence (fn )

n=1 converges in probability to f (fn f ) if and
only if for all > 0 one has
P({ : |fn() f ()| > }) 0
as n .
(3) If 0 < p < , then the sequence (fn )

n=1 converges with respect to
Lp
Lp or in the Lp -mean to f (fn f ) if and only if
|fn f |p 0
as n .
For the above types of convergence the random variables have to be defined
on the same probability space. There is a variant without this assumption.
Definition 4.1.2 [Convergence in distribution] Let (n , Fn , Pn ) and
(, F, P) be probability spaces and let fn : n and f : be
random variables. Then the sequence (fn )
n=1 converges in distribution
d
to f (fn f ) if and only if
(fn) (f )
as n
for all bounded and continuous functions : .

63
64
CHAPTER 4. MODES OF CONVERGENCE
We have the following relations between the above types of convergence.

Proposition 4.1.3 Let (, F, P) be a probability space and f, f1 , f2 , :
be random variables.
(1) If fn f a.s., then fn f .
Lp
(2) If 0 < p < and fn f , then fn f .
(3) If fn f , then fn f .
d
(4) One has that fn f if and only if Ffn (x) Ff (x) at each point x of
continuity of Ff (x), where Ffn and Ff are the distribution-functions of
fn and f , respectively.
(5) If fn f , then there is a subsequence 1 n1 < n2 < n3 < such

that fnk f a.s. as k .
Proof. See [4].
Example 4.1.4 Assume ([0, 1], B([0, 1]), ) where is the Lebesgue measure. We take
f1 = 1I[0, 1 ) , f2 = 1I[ 1 ,1] ,
2
2
f3 = 1I[0, 1 ) , f4 = 1I[ 1 , 1 ] , f5 = 1I[ 1 , 3 ) , f6 = 1I[ 3 ,1] ,
4
4 2
2 4
4
f7 = 1I[0, 1 ) , . . .
8
This implies limn fn (x) 0 for all x [0, 1]. But it holds convergence in
probability fn 0: choosing 0 < < 1 we get

({x [0, 1] : |fn (x)| > }) = ({x [0, 1] : fn (x) = 0})
if n = 1, 2
1 if n = 3, 4, . . . , 6
4
1
=
if n = 7, . . .
.
..
4.2
Some applications
We start with two fundamental examples of convergence in probability and

almost sure convergence, the weak law of large numbers and the strong law
of large numbers.
Proposition 4.2.1 [Weak law of large numbers] Let (fn )
n=1 be a
sequence of independent random variables with
f1 = m
and
(f1 m)2 = 2.
4.2. SOME APPLICATIONS
65
Then
f1 + + fn P
m
n
that means, for each > 0,
lim P
n
:|
n ,
as
f1 + + fn
m| >
n
0.
Proof. By Chebyshevs inequality (Corollary 3.6.9) we have that
f1 + + fn nm
>
n
|f1 + + fn nm|2
n 2 2
n
k=1 (fk
n 2 2
m))
n 2
0
n 2 2
=
as n .
Using a stronger condition, we get easily more: the almost sure convergence
instead of the convergence in probability.
Proposition 4.2.2 [Strong law of large numbers] Let (fn )
n=1 be a
sequence of independent random variables with fk = 0, k = 1, 2, . . . , and
c := supn fn4 < . Then
f1 + + fn
0 a.s.
n
Proof. Let Sn :=
Sn4
n
k=1
fk . It holds
4
fk
k=1
fi fj fk fl
i,j,k,l,=1
n
fk4 +
k=1
fk2fl2,
k,l=1
k=l
because for distinct {i, j, k, l} it holds
fifj3 = fifj2fk = fifj fk fl = 0

by independence. For example, fi fj3 = fi fj3 = 0 fj3 = 0, where one
gets that fj3 is integrable by |fj |3 (|fj |4 ) c . Moreover, by Jensens
3
4
inequality,
fk2
fk4 c.
3
4
66
CHAPTER 4. MODES OF CONVERGENCE
Hence
fk2fl2 = fk2fl2 c for k = l. Consequently,

Sn4 nc + 3n(n 1)c 3cn2,
and
n=1
This implies that
4
Sn
n4
Sn4
=
n4
Sn4
n4
n=1
0 a.s. and therefore
n=1
Sn
n
3c
< .
n2
0 a.s.
There are several strong laws of large numbers with other, in particular
weaker, conditions. Another set of results related to almost sure convergence
1
comes from Kolmogorovs 0-1-law. For example, we know that
n=1 n =
(1)n
but that
converges. What happens, if we would choose the
n=1
n
signs +, randomly, for example using independent random variables n ,
n = 1, 2, . . . , with
P({ : n() = 1}) = P({ : n() = 1}) = 21

for n = 1, 2, . . . . This would correspond to the case that we choose + and
according to coin-tossing with a fair coin. Put
A :=
:
n=1
n ()
converges
n
(4.1)
Kolmogorovs 0-1-law will give us the surprising a-priori information that 1

or 0. By other tools one can check then that in fact P(A) = 1. To formulate
the Kolmogorov 0-1-law we need
Definition 4.2.3 [tail -algebra] Let fn :
pings. Then
be sequence of map-
Fn = (fn , fn+1 , . . . ) := fk1 (B) : k = n, n + 1, ..., B B()

and
Fn .
T :=
n=1
The -algebra T is called the tail--algebra of the sequence (fn )

n=1 .
Proposition 4.2.4 [Kolmogorovs 0-1-law] Let (fn )
n=1 be a sequence
of independent random variables. Then
P(A) {0, 1}
Proof. See [5].
for all A T .
4.2. SOME APPLICATIONS
67
Example 4.2.5 Let us come back to the set A considered in Formula (4.1).
For all n {1, 2, ...} we have
A=
:
k=n
k ()
converges
k
Fn
so that A T .
We close with a fundamental example concerning the convergence in distribution: the Central Limit Theorem (CLT). For this we need
Definition 4.2.6 Let (, F, P) be a probability spaces. A sequence of
Independent random variables fn : is called Identically Distributed
(i.i.d.) provided that the random variables fn have the same law, that means
P(fn ) = P(fk )
for all n, k = 1, 2, ... and all .
Let (, F, P) be a probability space and (fn )
n=1 be a sequence of i.i.d. ran2
2
dom variables with f1 = 0 and f1 = . By the law of large numbers we
know
f1 + + fn P
0.
n
Hence the law of the limit is the Dirac-measure 0 . Is there a right scaling
factor c(n) such that
f1 + + fn
g,
c(n)
where g is a non-degenerate random variable in the sense that Pg = 0 ? And

in which sense does the convergence take place? The answer is the following
Proposition 4.2.7 [Central Limit Theorem] Let (fn )
n=1 be a sequence
2
2
of i.i.d. random variables with f1 = 0 and f1 = > 0. Then
f1 + + fn
x
n
for all x as n , that means that

f1 + + fn d
g
n
for any g with
P(g x) =
2
x
u2
e
du.
u2
e 2 du
Index
-system, 25
lim inf n An , 15
lim inf n n , 15
lim supn An , 15
lim supn n , 15
-system, 20
-systems and uniqueness of measures, 20
-Theorem, 25
-algebra, 8
-finite, 12
existence of sets, which are not

Borel, 26
expectation of a random variable,
41
expected value, 41
exponential distribution on , 22
extended random variable, 52
Fubinis Theorem, 52, 54
Gaussian distribution on , 22
geometric distribution, 21
algebra, 8
axiom of choice, 26
Holders inequality, 59
i.i.d. sequence, 67
independence of a family of events,
36
independence of a family of random variables, 35
independence of a finite family of
random variables, 36
independence of a sequence of
events, 15
Bayes formula, 17
binomial distribution, 20
Borel -algebra, 11
Borel -algebra on n , 20
Caratheodorys extension theorem,
19
central limit theorem, 67
Change of variables, 49
Chebyshevs inequality, 58
closed set, 11
conditional probability, 16
convergence almost surely, 63
convergence in Lp , 63
convergence in distribution, 63
convergence in probability, 63
convexity, 58
counting measure, 13
Jensens inequality, 58
Kolmogorovs 0-1-law, 66
law of a random variable, 33
Lebesgue integrable, 41
Lebesgue measure, 21, 22
Lebesgues Theorem, 47
lemma of Borel-Cantelli, 17
lemma of Fatou, 15
lemma of Fatou for random variables, 47
Dirac measure, 12
distribution-function, 34
dominated convergence, 47
measurable map, 32
measurable space, 8
equivalence relation, 25
68
INDEX
measurable step-function, 29
measure, 12
measure space, 12
Minkowskis inequality, 60
monotone class theorem, 52
monotone convergence, 45
open set, 11
Poisson distribution, 21
Poissons Theorem, 24
probability measure, 12
probability space, 12
product of probability spaces, 19
random variable, 30
Realization of independent random
variables, 38
step-function, 29
strong law of large numbers, 65
tail -algebra, 66
uniform distribution, 21
variance, 41
vector space, 51
weak law of large numbers, 64
69
70
INDEX
Bibliography
[1] H. Bauer. Probability theory. Walter de Gruyter, 1996.
[2] H. Bauer. Measure and integration theory. Walter de Gruyter, 2001.
[3] P. Billingsley. Probability and Measure. Wiley, 1995.
[4] A.N. Shiryaev. Probability. Springer, 1996.
[5] D. Williams. Probability with martingales. Cambridge University Press,
1991.
71

An Introduction To Probability Theory - Geiss

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

An Introduction To Probability Theory - Geiss

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An Introduction To Probability Theory - Geiss

Uploaded by

Copyright:

Available Formats

Christel Geiss and Stefan Geiss

Intuitively, we get random numbers D1 , D2 , ... having a certain distribution.

Secondly, we take a function

If the first expression is 0, then the calibration of the thermometer is right,

Given real numbers , , we use

CHAPTER 1. PROBABILITY SPACES

(3) A measure P, which gives a probability to any event A , that

1.1. DEFINITION OF -ALGEBRAS

The pair (, F), where F is a -algebra on , is called measurable space.

CHAPTER 1. PROBABILITY SPACES

for all j J. Consequently,

Proposition 1.1.5 [smallest -algebra containing a set-system]

1.1. DEFINITION OF -ALGEBRAS

Definition 1.1.6 [open and closed sets]

Then (G0 ) = (G1 ) = (G2 ) = (G3 ) = (G4 ) = (G5 ).

so that G5 (G3 ) and

CHAPTER 1. PROBABILITY SPACES

which proves G0 (G5 ) and

Now we introduce the measures we are going to use:

(1) A map : F [0, ] is called measure if () = 0 and for all

The triplet (, F, ) is called measure space.

Example 1.2.2 [Dirac and counting measure]

1.2. PROBABILITY MEASURES

(b) Counting measure: Let := {1 , ..., N } and F = 2 . Then

Note that P describes the binomial distribution with parameter p on

= 0 the -additivity (1.1) implies that

(2) If A1 , ..., An F such that Ai Aj = if i = j, then

P(A\B) = P(A) P(A B).

CHAPTER 1. PROBABILITY SPACES

(6) Continuity from below: If A1 , A2 , ... F such that A1 A2

(7) Continuity from above: If A1 , A2 , ... F such that A1 A2

Proof. (1) Here one has for An := that

so that P() = 0 is the only solution.

P(A B) + P(A\B) = P ((A B) (A\B)) = P(A).

(6) We define B1 := A1 , B2 := A2 \A1 , B3 := A3 \A2 , B4 := A4 \A3 , ... and

1.2. PROBABILITY MEASURES

Definition 1.2.5 [lim inf n An and lim supn An ] Let (, F) be a measurable

lim sup n := lim sup k .

(4) For example, taking n = (1)n , gives

Proposition 1.2.8 [Lemma of Fatou] Let (, F, P) be a probability space

lim inf P (An ) lim sup P (An )

The proposition will be deduced from Proposition 3.2.6 below.

Ak2 Akn ) = P (Ak1 ) P (Ak2 ) P (Akn ) .

CHAPTER 1. PROBABILITY SPACES

One can easily see that only demanding

P (A1 A2 An) = P (A1) P (A2) P (An) .

which is surely not, what we had in mind.

is called conditional probability of B given A.

1.2. PROBABILITY MEASURES

= P(a positive test result|person has the disease) = 0.95,

Applying the above formula we get

The proof is an exercise.

P(An) < , then P (lim supn An) = 0.

(2) If A1 , A2 , ... are assumed to be independent and

Proof. (1) It holds by definition lim supn An =

and the continuity of