Ee230 Lectures
Ee230 Lectures
These lecture notes were prepared with the purpose of helping the students to follow the lectures
more easily and efficiently. This course is a fast-paced course (like many courses in the depart-
ment) with a significant amount of material, and to cover all of this material at a reasonable pace
in the lectures, we intend to benefit from these partially-complete lecture notes. In particular,
we included important results, properties, comments and examples, but left out most of the
mathematics, derivations and solutions of examples, which we do on the board and expect the
students to write into the provided empty spaces in the notes. We hope that this approach will
reduce the note-taking burden on the students and will enable more time to stress important
concepts and discuss more examples.
These lecture notes were prepared mainly from our textbook titled ”Introduction to Probability”
by Dimitry P. Bertsekas and John N. Tsitsiklis, by revising the notes prepared earlier by Elif
Uysal-Biyikoglu and A. Ozgur Yilmaz. Notes of A. Aydin Alatan and discussions with fellow
lecturers Elif Vural, S. Figen Oktem and Emre Ozkan were also helpful in preparing these notes.
This is the first version of the notes. Therefore the notes may contain errors and we also believe
there is room for improving the notes in many aspects. In this regard, we are open to feedback
and comments, especially from the students taking the course.
Fatih Kamisli
May 19th , 2017.
1
Contents
2
2.4 Functions of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4.1 How to obtain pY (y) given Y = g(X) and pX (x) ? . . . . . . . . . . . . . . . . . 34
2.5 Expectation, Mean, and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5.1 Variance, Moments, and the Expected Value Rule . . . . . . . . . . . . . . . . . 35
2.5.2 Properties of Mean and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.5.3 Mean and Variance of Some Common R.V.s . . . . . . . . . . . . . . . . . . . . . 37
2.6 Multiple Random Variables and their Joint PMFs . . . . . . . . . . . . . . . . . . . . . 38
2.6.1 Functions of Multiple Random Variables . . . . . . . . . . . . . . . . . . . . . . . 40
2.6.2 More Than Two Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.7 Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.7.1 Conditioning a random variable on an event . . . . . . . . . . . . . . . . . . . . . 42
2.7.2 Conditioning one random variable on another . . . . . . . . . . . . . . . . . . . . 43
2.7.3 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.8 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.8.1 Independence of a Random Variable from an Event . . . . . . . . . . . . . . . . . 49
2.8.2 Independence of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.8.3 Variance of the Sum of Independent Random Variables . . . . . . . . . . . . . . 51
3
4.2 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.3 Conditional Expectations Revisited (Iterated Expectations) . . . . . . . . . . . . . . . . 90
4.4 Transforms (Moment Generating Functions) . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.4.1 From Transforms to Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.4.2 Inversion of Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.4.3 Sums of Independent R.V.s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.4.4 Transforms Associated with Joint Distributions . . . . . . . . . . . . . . . . . . . 94
4.5 Sum of a Random Number of Independent R.V.s . . . . . . . . . . . . . . . . . . . . . . 94
5 Limit Theorems 97
5.1 Markov and Chebychev Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.2 The Weak Law of Large Numbers (WLLN) . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.3 Convergence in Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.4 The Central Limit Theorem (CLT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4
Chapter 1
Contents
1.1 Set Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 Set Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.2 Properties of Sets and Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Probabilistic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.1 Random Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.2 Sample Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.3 Probability Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.4 Properties of Probability Laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.5 Discrete Probabilistic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.6 Continuous Probabilistic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.1 Properties of Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.2 Chain (Multiplication) Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4 Total Probability Theorem and Bayes’ Rule . . . . . . . . . . . . . . . . . . 17
1.5 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.5.1 Properties of Independent Events . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.5.2 Conditional Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.5.3 Independence of a Collection of Events . . . . . . . . . . . . . . . . . . . . . . . . 22
1.5.4 Independent Trials and Binomial Probabilities . . . . . . . . . . . . . . . . . . . 24
1.6 Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5
Element of a set (and not element of a set)
The universal set (Ω): The set which contains all the sets under investigation
Some relations
1. Union
2. Intersection
3. Complement of a set
6
4. Difference
• Disjoint sets
– Two sets A and B are called disjoint (or mutually exclusive) if A ∩ B = ∅.
– More generally, multiple sets are called disjoint if no two of them have a common
element.
• Partition of a set
– A collection of sets is said to be a partition of a set S if the sets in the collection
are disjoint and their union is S.
• Associative: A ∪ (B ∪ C) = (A ∪ B) ∪ C, A ∩ (B ∩ C) = (A ∩ B) ∩ C
• Distributive: A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C) A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C)
• A∩∅= A∪∅= ∅c = Ωc =
• A ∪ Ac = A ∩ Ac = A∪Ω= A∩Ω=
• De Morgan’s laws
– (A ∪ B)c =
– (A ∩ B)c =
• The cartesian product of two sets A and B is the set of all ordered pairs such that
A × B = {(a, b)|a ∈ A, b ∈ B}.
Ex:
7
• Size (cardinality) of a set : The number of elements in the set, shown for a set A as |A|.
Ex:
1. Random experiment
2. Sample space
3. Probability law.
Definition 3 Every probabilistic model involves an underlying process called the random experiment,
that will produce exactly one out of several possible outcomes.
Ex:
Definition 4 The set of all possible results (or outcomes) of a random experiment is called the
sample space (Ω) of the experiment.
8
Ex:
Definition 5 A subset of the sample space Ω (i.e. a collection of possible outcomes) is called an
event.
Ex: In the above example where we toss a coin 3 times, define event A as the event that exactly
two H’s occur.
Some definitions
• The sample space should have enough detail to distinguish between all outcomes of interest
while avoiding irrelevant details for the given problem.
Ex: Consider two games both involving 5 successive coin tosses.
Game 1 : We get $1 for each H.
Game 2 : We get $1 for each toss upto first H. Then we receive $2 for each toss upto second
H and so on. In particular, the received amount per toss doubles each time a H comes up.
We are interested in the amount of money we win in both games. In Game1, only total
number of H’s is important, while in Game2 knowing the tosses in which H occur is also
important.
=⇒
9
Sequential models
Tree-based sequential descriptions are very useful for such experiments to describe the experi-
ment and the associated sample space.
Ex: Two rolls of a 4 sided die
Definition 6 The probability law assigns to every event A a nonnegative number P(A) called
the probability of event A.
• Intuitively, the probability law specifies the ”likelihood” of any outcome or any event.
• P(A) is often intended as a model for the frequency with which the experiment produces a
value in A when repeated many times independently.
Ex: : We have a biased coin with P (”H”) = 0.6. Throw the coin 100 times. How many
”Heads” do you expects to observe ?
Probability Axioms
More generally, if the sample space has an infinite number of elements and A1 , A2 , . . . is a
sequence of disjoint events, then
3. (Normalization) P(Ω) = 1
10
To visualize the probability law, consider a mass of 1, which is spread over the sample space.
Then, P(A) is the total mass that was assigned to the elements of A.
(a) P(∅) = 0
11
(f) P(A ∪ B ∪ C) = P(A) + P(Ac ∩ B) + P(Ac ∩ B c ∩ C)
• The probability of any event A = {s1 , s2 , . . . , sk } is the sum of the probabilities of its
elements.
• Discrete uniform probability law: The sample space consists of |Ω| = n possible outcomes
which are equally likely. Then the probability of any event A = {s1 , s2 , . . . , sk } can be
obtained by counting the elements in A and Ω :
|A|
P(A) = .
|Ω|
Ex: Roll a pair of 4-sided die. Let us assume the die are fair and therefore each possible outcome
is equally likely.
Ex: A box contains 100 light bulbs. The probability that there exists at least one defective bulb
is 0.1. The probability that there exist at least two defective bulbs is 0.05. Let the number of
defective bulbs be the outcome of the random experiment.
12
(b) Find P(no defective bulbs).
Ex: Romeo and Juliet arrive equiprobably between 0 and 1 hour. The first one arrives and waits
for 14 hours and leaves. What is the probability that they meet ?
13
1.3 Conditional Probability
Conditional probabilities reflect our beliefs about events based on partial information about the
outcome of the random experiment.
Ex: There are 200 light bulbs in a box, where each bulb is either a 50, 100 or 150 Watt bulb,
and each bulb is either defective or good. (A table indicating the number of each type of bulb is
given.) Let’s choose a bulb at random from the box.
Definition 7 Let A and B two events with P(A) 6= 0. The conditional probability P(B|A) of B
given A is defined as
P(A ∩ B)
P(B|A) = .
P(A)
Theorem 1 For a fixed event A with P(A) > 0, the conditional probabilities P(B|A) form a
legitimate probability law satisfying all three axioms.
Proof 1
14
• If A and B are disjoint, P(B|A) = .
• If A ⊂ B, P(B|A) = .
• Since conditional probabilities form a legitimate probability law, general properties of prob-
ability law remain valid. For example
Ex: A fair coin is tossed three successive times. Two events are defined as A = {more H than
T come up}, B = {1st toss is a H}. Determine the sample space and find P(A|B).
Ex: An urn has balls numbered with 1 through 5. A ball is drawn and not put back. A second
ball is drawn afterwards. What is the probability that the second ball drawn has an odd number
given that
Ex: Two cards are drawn form a deck in succession without replacement. What is the probability
that first card is spades and second card is clubs ?
15
Ex: Radar detection problem. If an aircraft is present in a certain area, a radar detects it and
generates and alarm signal with probability 0.99. If an aircraft is not present, the radar generates
a (false) alarm with probability 0.10. We assume that an aircraft is not present with probability
0.95.
Assuming all of the conditioning events have positive probability, following expression holds
n
\ n−1
\
P( Ai ) = P(A1 )P(A2 |A1 )P(A3 |A1 ∩ A2 ) . . . P(An | Ai ).
i=1 i=1
16
Ex: There are two balls in an urn numbered with 0 and 1. A ball is drawn. If it is 0, the ball is
simply put back. If it is 1, another ball numbered with 1 is put in the urn along with the drawn
ball. The same operation is performed once more. A ball is drawn in the third time. What is
the probability that the drawn balls were all numbered with 1?
Ex: The Monty Hall problem. Look at Example 1.12 (pg.27) from textbook or look at this video
on youtube.
Theorem 2 (Total Probability) Let A1 , . . . , An be disjoint events that form a partition of the
sample space and assume that P(Ai ) > 0 for all i. Then, for any event B, we have
Proof 2
17
• Special case, n = 2: P(B) = P(B|A)P(A)+
Ex: A box of 250 transistors all equal in size, shape etc. 100 of them are manufactured by A, 100
by B, and the rest by C. The transistors from A, B, C are defective by 5, 10, 25%, respectively.
Theorem 3 (Bayes’ Rule) Let A1 , . . . , An be disjoint events that form a partition of the sample
space and assume that P(Ai ) > 0 for all i. Then, for any event B with P(B) > 0, we have
P(B|Ai )P(Ai )
P(Ai |B) =
P(B)
P(B|Ai )P(Ai )
= .
P(B|A1 )P(A1 ) + P(B|A2 )P(A2 ) + . . . + P(B|An )P(An )
Proof 3
Ex: A coin is tossed and a ball is drawn. If the coin is heads, the ball is drawn from Box H with
3 red and 2 black balls. If the coin is tails, Box T is used which has 2 red and 8 black balls.
(b) P(black)=?
(c) If a red balls is drawn, what is the probability that a heads is thrown?
18
Ex: Radar detection problem revisited. A = {an aircraft is present}, B = {the radar generates
an alarm}. P(A) = 0.05, , P(B|A) = 0.99, , P(B|Ac ) = 0.1. What is the probability that an
aircraft is actually present if the radar generates an alarm?
Ex: A test for a certain disease is assumed to be correct 95% of the time: if a person has the
disease, the test results are positive with probability 0.95, and if the person is not sick, the test
results are negative with probability 0.95. A person randomly drawn from the population has
the disease with probability 0.01. Given that a person is tested positive, what is the probability
that the person is actually sick?
Let A1 , . . . , An be disjoint events that form a partition of the sample space and the conditional
probabilities used below be defined. Then
n
X
P (B|C) = P (B|C ∩ Ai )P (Ai |C).
i=1
19
1.5 Independence
Conditional probability P(A|B) captures partial information that event B provides about event
A.
• In some cases, the occurrence of event B may provide no such information and not change
the probability of A, i.e. P(A|B) = P(A).
Definition 8 Two events A and B are independent if and only if P(A ∩ B) = P(A)P(B).
Ex: Toss 2 fair coins. Event A: first toss is H. Event B: second toss is T. Are A, B independent?
Notes :
• Independence is often easy to grasp intuitively. If the occurrence of two events A, B are
actuated by distinct and noninteracting physical processes, then the events are independent.
Ex: Toss a fair coin and roll a fair 6-sided die. Find the probability that the coin toss come up
Heads and the die roll results in 5.
Ex: An urn has 4 balls where two of them are red and the others green. Each color has a ball
numbered with 1, and the other is numbered with 2. A ball is randomly drawn. A: event that
the ball drawn is red. B: event that the ball drawn has an odd number on it.
20
(b) Add another red ball with 3 on it. Are A and B independent now?
• If A and B are disjoint with P(A) > 0 and P(B) > 0, they are always dependent.
Recall that conditional probabilities form a legitimate probability law. We can then talk about
various events with respect to this probability law.
Definition 9 Given an event C, events A and B are conditionally independent if and only if
21
Conditioning may affect/change independence, as shown in the following example.
Ex: Consider two independent tosses of a fair coin. A = First toss is H, B = Second toss is
H, C = First and second toss have the same outcome. Are A and B independent? Are they
independent given C?
Ex: We have two unfair coins, A and B with P(H|coinA) = 0.9, P(H|coinB) = 0.1. We choose
either coin with equal probability and perform independent tosses with the chosen coin.
(b) If we don’t know which coin is chosen, are future tosses independent? Compare P(2nd toss is a T)
and P(2nd toss is a T|first toss is T).
Information on some of the events tells us nothing about the occurrence of the others.
22
• How many equations do you need to check to verify independence of 4 events ?
Ex: Consider two independent tosses of a fair coin. A = {First toss is H}, B = {Second toss is
H}, C = {Tosses have same outcomes}. Are these events independent?
• Pairwise independence does not imply independence! (Checking P(Ai ∩ Aj ) = P(Ai )P(Aj )
for all i and j is not sufficient for confirming independence!)
• For three events, checking P(A1 ∩A2 ∩A3 ) = P(A1 )P(A2 )P(A3 ) is not enough for confirming
independence!
• A1 independent of A2 ∪ A3
Ex: (Network Connectivity) A computer networks connects nodes A and B via intermediate
nodes C,D,E. For every pair of directly connected nodes (e.g. nodes i and j), there is a probability
pi,j that the link from i to j is up/working. Assuming link failures are independent of each other,
what is the probability that there is a path from A to B in which all links are up ?
23
1.5.4 Independent Trials and Binomial Probabilities
Bernoulli trials: Independent trials where each identical stage has only two possible outcomes
(e.g. 5 tosses of a coin.) The two possible outcomes can be anything (e.g. Yes or No, Male or
female,..) but we will often think in terms of coin tosses and refer to the two results as H and T.
Consider the Bernoulli trials where we have n tosses of a biased coin with P(H) = p. For
this experiment, we have the following important result :
To obtain this results, let us consider the case where n = 3 and k = 2, and visualize the Bernoulli
trials by means of a sequential description :
24
Let us now generalize to the case with n tosses :
Ex: A fair die is rolled four times independently. Find the probability that ’3’ will show up only
twice.
(c) Given that a 100 is sent, what is the probability that 110 is received?
(d) In an effort to improve reliability, each symbol is repeated 3 times and the received string is
decoded by majority rule. What is the probability that a transmitted 1 is correctly decoded?
(e) Suppose that repetition encoding is used. What is the probability that 0 is sent given that
101 is received?
25
Ex: An Internet service provider (ISP) has c modems to serve n customers. At any time, each
customer needs a connection with probability p, independent of others. What is the probability
that there are more customers needing a connection than there are modems ?
1.6 Counting
In random experiments with finite number of possible outcomes that are all equally likely, finding
probabilities reduces to counting the number of outcomes in events and the sample space.
Ω = {s1 , s2 , . . . , sn }
1
P({sj }) = , for all j
n
A = {sj1 , sj2 , . . . , sjk }, jk ∈ {1, 2, . . . , n}
P(A) =
Ex: 6 balls in an urn, Ω = {1, 2, . . . , 6}. Event A = {the number on the ball drawn is divisible by 3}.
Permutations : The number of different ways that one can pick k out of n objects when order
n!
of picking is important : .
(n − k)!
Ex: Count the number of 4-letter password with distinct letters where letters are from the
English alphabet and are small case only.
26
Combinations : The number of different ways that one can pick k out of n objects when order
n!
of picking is not important : nk =
.
k!(n − k)!
Partitions : Number of different ways to partition a set with n elements into r disjoint subsets
n!
with the ith subset having size ni : n1 ,n2n,...,nr =
.
n1 !n2 !...nr !
Ex: How many different words can be formed by the letters in the word RATATU?
Ex: Four balls colored red, green, blue and yellow are distributed randomly to three boxes,
Box1, Box2 and Box3. Each ball can go to any box with equal probability.
1. Find the probability that Box1 contains the red ball and Box2 the remaining balls.
3. Given Box1 contains exactly one ball, what is the probability that exactly two boxes contain
equal number of balls.
27
Chapter 2
Contents
2.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2 Probability Mass Functions (PMF) . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.1 How to calculate the PMF of a random variable X . . . . . . . . . . . . . . . . 30
2.2.2 Properties of PMF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3 Some Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.1 The Bernoulli R.V. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.2 The Binomial R.V. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.3 The Geometric R.V. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.4 The Poisson R.V. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.5 The Uniform R.V. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4 Functions of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4.1 How to obtain pY (y) given Y = g(X) and pX (x) ? . . . . . . . . . . . . . . . . . 34
2.5 Expectation, Mean, and Variance . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5.1 Variance, Moments, and the Expected Value Rule . . . . . . . . . . . . . . . . . 35
2.5.2 Properties of Mean and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.5.3 Mean and Variance of Some Common R.V.s . . . . . . . . . . . . . . . . . . . . . 37
2.6 Multiple Random Variables and their Joint PMFs . . . . . . . . . . . . . . 38
2.6.1 Functions of Multiple Random Variables . . . . . . . . . . . . . . . . . . . . . . . 40
2.6.2 More Than Two Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.7 Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.7.1 Conditioning a random variable on an event . . . . . . . . . . . . . . . . . . . . . 42
2.7.2 Conditioning one random variable on another . . . . . . . . . . . . . . . . . . . . 43
2.7.3 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.8 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.8.1 Independence of a Random Variable from an Event . . . . . . . . . . . . . . . . . 49
2.8.2 Independence of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.8.3 Variance of the Sum of Independent Random Variables . . . . . . . . . . . . . . 51
28
2.1 Basics
Many random experiments have outcomes that are numerical. For example, number of people
in a bus, the time a customer will spend in line in a bank etc.
In random experiments with outcomes that are not numerical, it is very useful to map outcomes
to numerical values. For example, in a coin toss experiment : H → 1 and T → 0.
Ex: Two rolls of a 4-faced die. Let X be the maximum of the two rolls.
• The 5-long sequence of heads and tails is not a random variable since it does not have an
explicit numerical value.
Definition 11 The range of a random variable is the set of values it can take.
Ex: Two rolls of a 4-faced die. Let X be the maximum of the two rolls.
29
Definition 12 A random variable is discrete if its range is a countable (finite or infinite) set.
Note that a function of a random variable is another mapping from the sample space to real
numbers, hence another random variable.
The most important way to characterize a discrete random variable is through its Probability
Mass Functions (PMF).
Definition 13 If x is any possible value that a random variable X can take, the probability mass
of x, denoted by pX (x), is the probability of the event {X = x} that consists of all outcomes that
are mapped to x by X:
pX (x) = P({X = x}).
1. Collect all possible outcomes of the random experiment that give rise to the event {X = x}.
Ex: Two rolls of a fair 4-faced die. Let X be the maximum of the two rolls. Find pX (k).
30
Ex: A fair coin is tossed twice. The random variable X is defined as the number of times heads
shows up. Define events {X = 0}, {X = 1}, {X = 2} and find PMF pX (k).
P
2. P (X ∈ S) = x∈S pX (x)
p, if k = 1
pX (k) =
1 − p, if k = 0
• A simple r.v. that can take only one of two values : 1 and 0.
• Bernoulli r.v. can model random experiments with only two possible outcomes. For ex.,
– a coin toss,
– an experiment with possible outcome of either success or failure.
• Despite its simplicity, Bernoulli r.v. is very important because multiple Bernoulli r.v. can
be combined to model more complex random variables.
• Short-hand notation :
pX (k) = P(X = k) =
• The Binomial r.v. is the number of 1’s in a random experiment that consists of n stages
where each stage is an independent and identical Bernoulli random variable (Ber(p)). E.g.,
31
– number of H’s in n tosses of a coin,
– number of successes in n trials.
n=7, p=0.25
0.4
0.3
pX(k)
0.2
0.1
0
0 1 2 3 4 5 6 7
k
n=40, p=0.25
0.2
0.15
pX(k)
0.1
0.05
0
−5 0 5 10 15 20 25 30 35 40 45
k
pX (k) = , k = 0, 1, ...
• The Geometric r.v. is the number of stages until first 1 in a random experiment
that consists of stages where each stage is an independent and identical Bernoulli random
variable (Ber(p)). E.g.
32
2.3.4 The Poisson R.V.
λk
pX (k) = e−λ , k = 0, 1, 2, ..
k!
• A Poisson r.v. expresses the probability of how many times an event occurs in a fixed
period of time if these events
• The poisson r.v. is used to model traffic in many practical problems, such as number of cars
arriving at an intersection, or number of Internet packets arriving at a router, or number
of customers arriving at a supermarket, etc..
Ex: Traffic engineers often use a Poisson r.v. to model flow of cars in traffic. Suppose that
during the hours from 9 to 10 each day, on average 4.7 cars pass through a red light without
stopping. During such an hour, what is the probability that k cars pass through without stopping
at that red light without stopping ?
A uniform random variable has a finite range given by set S and has uniformly distributed
probabilities for all values it can take from S.
1
pX (k) = for all k ∈ S.
|S|
• An engineering problem : system with input x and output y related through y = g(x).
33
Ex: A uniform r.v. X whose range is the integers in [−2, 2]. It is passed through a transformation
Y = |X|. Find PMF of Y .
To obtain pY (y) for any y, add the probabilities of the values x that results in g(x) = y:
X
pY (y) = pX (x).
x:g(x)=y
Ex: A uniform r.v. whose range is the integers in [−3, 3]. It is passed through a transformation
Y = u(X) where u(·) is the discrete unit step function.
Definition 14 The expected value (also called expectation or mean) of a r.v. X is defined
as X X
E[X] = x · P (X = x) = x · pX (x).
x x
• The mean of a r.v. X is the weighted average (in proportion to probabilities) of the possible
values of X.
34
2.5.1 Variance, Moments, and the Expected Value Rule
Definition 15 The variance of a r.v. X provides a measure of spread of its PMF around its
mean and is defined as
var(X) = E[(X − E[X])2 ]
Ex: Consider the length (in cm) of gilt-head bream (çipura, Sparus aurata) sold at market as a
random variable X. Find the variance with the following PMFs.
A typically easier way to calculate var(X) is due to the following rule, called the expected
value rule.
Theorem 4 Let X be a r.v. with PMF pX (x) and g(X) be a function of X. Then,
X
E[g(X)] = g(x)pX (x).
x
Proof:
35
The expected value rule can be utilized to evaluate the variance.
X
var(X) = E[(X − E[X])2 ] = (x − E[X])2 pX (x).
x
• The variance of Y = aX + b
36
• Unless g(X) is linear, E[g(X)] 6= g(E[X]). (see example below)
Ex: Average speed vs average time. On the highway from Gebze to Kartal, one drives at a
speed of 120kmh when there is no traffic jam. The speed drops to 30kmh in case of traffic jam.
The probability of traffic jam is 0.2 and the distance from Gebze to Kartal through the highway
is 60km. Find the expected value of the driving time.
37
2.6 Multiple Random Variables and their Joint PMFs
More than one random variable can can be associated with the same random experiment, in
which case one can consider a mapping from the sample space Ω to the real plane R2 .
Definition 17 For two random variables X and Y that are associated with the same random
experiment, the joint PMF of X and Y is defined as
• More precise notations for P(X = x, Y = y): P(X = x and Y = y), P({X = x}∩{Y = y}),
P({X = x, Y = y}).
• How to calculate joint PMF pX,Y (x, y) ? For each possible pair of values (x, y) that X and
Y can take :
1. Collect all possible outcomes of the random experiment that give rise to the event
{X = x, Y = y}.
2. Add their probabilities to obtain pX,Y (x, y).
Ex: In the above example, assume the coin is fair and find joint PMF pX,Y (x, y).
• The joint PMF pX,Y (x, y) determines the probability of any event involving X and Y :
X
P ((X, Y ) ∈ A) = pX,Y (x, y).
(x,y)∈A
38
Ex: In the above example, what is the probability that both X and Y are less than 3?
The term marginal PMF is used for pX (x) and pY (y) to distinguish them from the joint PMF
pX,Y (x, y).
• Can one find marginal PMFs from the joint one ? How ?
P
– pX (x) = y pX,Y (x, y)
P
– pY (y) = x pX,Y (x, y)
Pf. : Note that the event {X = x} is the union of the disjoint sets {X = x, Y = y} as y
ranges over all the different values of Y . Then,
pX (x) = P(X = x)
[
= P({X = x}) = P( {X = x, Y = y})
y
X X
= P({X = x, Y = y}) = P(X = x, Y = y)
y y
X
= pX,Y (x, y).
y
P
Similarly, pY (y) = x pX,Y (x, y).
• The above equations indicate that if the joint PMF are arranged in a table, then one can find
marginal PMFs by adding the table entries along the columns or rows (this method is
called tabular method).
Ex: Two r.v.s X and Y have the joint PMF pX,Y (x, y) given in the 2-D table below. Find
marginal PMFs. Find the probability that X is smaller than Y ?
1 2 3
1 0 1/10 2/10
2 1/10 1/15 1/30
3 2/10 2/10 1/10
39
• Note that one can find marginal PMFs from the joint PMF, but the reverse is not true in
general.
Ex: Consider the joint PMF of two r.v.s X and Y that share the same range of values {0, 1, 2, 3}:
c 1 < x + y ≤ 3
pX,Y (x, y) = .
0 otherwise
Ex: Two r.v.s X and Y have the joint PMF pX,Y (x, y) given in the 2-D table below. Find PMF
of Z = 2X + Y . Find E[Z].
1 2 3
1 0 1/10 2/10
2 1/10 1/15 1/30
3 2/10 2/10 1/10
40
2.6.2 More Than Two Random Variables
pX,Y,X (x, y, z) = P (X = x, Y = y, Z = z)
The expected value rule naturally extends to functions of more than one random variable:
XX
E[g(X, Y )] = g(x, y)pX,Y (x, y)
x y
XXX
E[g(X, Y, Z)] = g(x, y, z)pX,Y,Z (x, y, z)
x y z
E[aX + bY + c] =
Ex: Suppose that n people throw their hats in a box and then the hats are randomly distributed
back to the people. What is the expected value of X, the number of people that get their own
hats back ?
41
2.7 Conditioning
Ex: There are two coins with different probabilities of H’s : 1/2 and 1/4. One of the coins is
selected in the beginning. The number of heads after 4 coin tosses is denoted by X.
Definition 18 The conditional PMF of the random variable X, conditioned on the event A with
P(A) > 0 is defined by
P({X = x} ∩ A)
pX|A (x|A) = P(X = x|A) = .
P(A)
• Let us show that pX|A (x) is a legitimate PMF. (Expand P(A) using the total probability
theorem)
Ex: Let X be the outcome of one roll of a tetrahedral die, and A be the event that the outcome
is not 1.
42
Ex: Ali has a total of n chances to pass his motorcycle license test. Suppose each time he takes
the test, his probability of passing is p, irrespective of what happened in the previous attempts.
What is the PMF of the number of attempts, given that he passes?
Ex: Consider an optical communications receiver that uses a photodetector that counts the
number of photons received within a constant time unit. The sender conveys information to
the receiver by transmitting or not transmitting photons. There is shot noise at the receiver,
and consequently even if nothing is transmitted during that time unit, there may be a positive
count of photons. If the sender transmits (which happens with probability 1/2), the number of
photons counted (including the noise) is Poisson with parameter a + n (a > 0, n > 0). If nothing
is transmitted, the number of photons counted by the detector is again Poisson with parameter
n. Given that the detector counted k photons, what is the probability that a signal was sent?
Examine the behavior of this probability with a, n and k.
43
• The knowledge that Y equals a particular value y may provide a partial knowledge on X.
pX,Y (x, y)
pX|Y (x|y) =
pY (y)
Pf.
• Note that using the joint PMF, one can obtain the marginal and the conditional PMFs.
• Let us fix some value y (with pY (y) > 0) and consider conditional PMF pX|Y (x|y) as a
function of only x :
– Then conditional PMF pX|Y (x|y) assigns non-negative values (i.e. probabilities) to
each possible x and these values add up to 1 :
X
pX|Y (x|y) = 1
x
– Furthermore, joint PMF pX|Y (x|y) has the same shape as joint PMF pX,Y (x, y) except
that it is divided by pY (y) which performs normalization.
44
Ex: The joint PMF of two r.v.s X and Y that share the same range of values {0, 1, 2, 3} is given
by
1/7 1 < x + y ≤ 3
pX,Y (x, y) = .
0 otherwise
One can obtain the following sequential expressions directly from the definition:
X X
pX (x) = pX,Y (x, y) = pY |X (y|x)pX (x)
y y
Conditional PMFs involving more than two random variables are defined similarly :
Ex: A die is tossed and the number on the face is denoted by X. A fair coin is tossed X times
and the total number of heads is recorded as Y .
45
Ex: (From textbook) Prof. Right answers each student question incorrectly with probability 14 ,
independent of other questions. In each lecture Prof. Right is asked 0,1, or 2 questions with
equal probability of 13 . Let X and Y be the number of questions Prof. Rights is asked and the
number of questions she answers incorrectly in a given lecture, respectively. find the joint PMF
pX,Y (x, y).
Ex: A transmitter is sending messages over a computer network. Let X be the travel time of a
given message and Y be the length of a given message. The length of the message Y can take two
possible values : y = 102 bytes with probability 56 and y = 104 bytes with probability 16 . Travel
time X of the message depends on its length Y and the congestion in the network at the time
of transmission. In particular, the travel time X is 10−4 Y seconds with probability 12 , 10−3 Y
seconds with probability 13 , and 10−2 Y seconds with probability 61 . Find the PMF of travel time
X.
46
2.7.3 Conditional Expectation
Conditional Expectations are defined similar to ordinary expectations except that conditional
PMFs are used :
X
E[X|A] = x · pX|A (x)
x
X
E[X|Y = y] = x · pX|Y (x|y)
x
Let us recall the Total Probability Theorem in the context of random variables.
• For Ai ’s forming a partition of the sample space, we have
X
pX (x) = P({X = x}) = P({X = x}|Ai )P(Ai )
i
X
=
i
• Similarly, the sets {Y = y} as y goes over the entire range of Y form a partition of the
sample space, and we have
X
pX (x) = P({X = x}) = P({X = x}|{Y = y})P({Y = y})
y
X
=
y
=
=
X X X
E[X] = xpX (x) = x pX|Y (x|y)pY (y).
x x y
=
=
47
The equalities obtained above are collectively called the Total Expectation Theorem:
X
E[X] = E[X|Ai ]P(Ai )
i
X
E[X] = E[X|Y = y]pY (y)
y
Ex: Data flows entering a router are low rate with probability 0.7, and high rate with probability
0.3. Low rate sessions have a mean rate of 10 kbps, and high rate ones have a rate of 200 kbps.
What is the mean rate of flow entering the router?
Ex: Find the mean and variance of the Geometric random variable (with parameter p) using
the Total Expectation Theorem. (Hint: condition on the events {X = 1} and {X > 1}.
48
Ex: Consider two rolls of a fair die. Let X be the total number of 6’s, and Y be the total number
of 1’s. Find E[X|Y = y] and E[X].
Reading assignment: Example 2.18: The two envelopes paradox, and Problem 2.34: The spider
and the fly problem.
2.8 Independence
The results developed here will be based on the independence of events we covered before : Two
events A and B are independent if and only if P(A ∩ B) = P(A)P(B).
For the independence of a rv X and an event A, the events {X = x} and A should be independent
for all x values.
49
.
Ex: Consider two tosses of a coin. Let X be the number of heads and let A be the event that
the number of heads is even.
• If X and Y are independent, then E[XY ] = E[X]E[Y ]. (Reverse may not be true.)
50
• If X and Y are independent, then E[g(X)h(Y )] = E[g(X)]E[h(Y )].
• The independence definition given above can be extended to multiple random variables in
a straightforward way. For example, three random variables X, Y, Z we have
Ex: Joint PMF of X and Y are given in the table. Are X and Y independent ?
1 2 3
1 0 1/10 2/10
2 1/10 1/15 1/30
3 2/10 2/10 1/10
For two independent random variable X and Y , consider their sum called Z = X + Y . Let’s
find expectation and variance of Z.
• E[Z] = E[X + Y ] = E[X] + E[Y ] (this is true even if X and Y are not independent!)
• var(Z) = var(X + Y ) =
Pf.
51
If one repetitively uses the above results, the general formula for the sum of multiple independent
random variables is obtained.
• E[X1 + X2 + ... + Xn ] =
• var(X1 + X2 + ... + Xn ) =
Ex: (Mean and variance of the sample mean) Let X1 , X2 ,..., Xn be independent Bernoulli random
variables with common mean and variance (i.e. they are i.i.d.). Let us define the sample mean
and find its mean and variance.
Ex: (Estimating probabilities by simulation) In many practical situations, the analytical cal-
culation of the probability of some event of interest is very difficult. If so, we may resort to
(computer) simulations where we observe the outcomes of a certain experiment performed many
times independently. Say we are interested in finding P(A) of an event A defined from the
experiment. Define the sample mean and find its mean and variance.
52
Chapter 3
Contents
3.1 Continuous Random Variables and PDFs . . . . . . . . . . . . . . . . . . . . 54
3.1.1 Some Continuous Random Variables and Their PDFs . . . . . . . . . . . . . . . 56
3.1.2 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2 Cumulative Distribution Functions . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2.1 Properties of CDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.2.2 Hybrid Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.3 Normal (Gaussian) Random Variables . . . . . . . . . . . . . . . . . . . . . . 63
3.3.1 The StandardNormal R.V. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.4 Multiple Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . 66
3.4.1 Joint CDFs, Mean, More than Two R.V.s . . . . . . . . . . . . . . . . . . . . . . 68
3.5 Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.5.1 Conditioning One R.V. on Another . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.5.2 Conditional CDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.5.3 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.6 Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.7 The Continuous Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.7.1 Inference about a discrete random variable . . . . . . . . . . . . . . . . . . . . . 77
3.7.2 Inference based on discrete observations . . . . . . . . . . . . . . . . . . . . . . . 78
3.8 Some Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
In some random experiments, one can define random variables that can take on a continuous
range of possible values. For example, your weight is a continuous random variable. (If
you round your weight to an integer, then it becomes a discrete random variable.) The use of
continuous models may result in insights not possible with discrete modeling.
All of the concepts and tools introduced for discrete random variables, such as PMFs, expectation
and conditioning, have continuous counterparts and will be discussed in this chapter.
53
3.1 Continuous Random Variables and PDFs
for every subset B of the real line. In particular, the probability that X is in an interval is
Z b
P(a ≤ X ≤ b) = fX (x)dx.
a
2. Normalization :
54
To interpret the intuition behind PDF, consider a small interval [x. x + δ] where δ 1 :
R x+δ
• P(x < X ≤ x + δ) = x fX (a)da ≈
• fX (x) can be viewed as the ”‘probability mass per unit length near x”’ (or density of
probability mass near x)
Ex: (PDF can be larger than 1) PDF of random variable X is given below. Find c and P (|X|2 ≤
0.5). (
cx2 , 0 ≤ x ≤ 1
fX (x) =
0 , o.w.
Ex: (A PDF can take arbitrarily large values) Consider a r.v. with
(
cx−1/2 , 0 ≤ x ≤ 2
fX (x) =
0 , o.w.
55
3.1.1 Some Continuous Random Variables and Their PDFs
If the probability density is uniform (i.e. constant) over the set of values that the rv takes, we
have a continuous uniform rv.
(
1
b−a
,a < x < b
fX (x) = .
0 , o.w.
1 (x−µ)2
fX (x) = √ e− 2σ 2
2πσ 2
0.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
−4 −3 −2 −1 0 1 2 3 4
56
Exponential R.V.
1.8
1.6
1.4
1.2
fX(x)
0.8
0.6
0.4
0.2
0
0 2 4 6 8 10
x
3.1.2 Expectation
By changing summation with an integral, the following definition for the expected value (or mean
or expectation) of a continuous r.v. is obtained.
Z ∞
E[X] = xfX (x)dx
−∞
Many properties are directly carried over from the discrete counterpart.
57
– Y = g1 (X) = 2X
– Y = g2 (X) = u(X)
• var(X) =
• If Y = aX + b
– E[Y ] =
– var(Y ) =
58
3.2 Cumulative Distribution Functions
A discrete r.v. is characterized by its PMF whereas a continuous one with its PDF. Cumu-
lative distribution functions (CDF) are defined to characterize all sorts of r.v.s.
• The CDF FX (x) is the accumulated probability ”up to (and including)” the value x.
• piecewise constant .
• continuous from right but not from left at the jump points.
59
Hence, for continous rvs, CDFs FX (x) are
• continuous, i.e. from right and left. (i.e. no jumps in CDFs)
(a) • 0 ≤ FX (x) ≤ 1
• FX (−∞) = limx→−∞ FX (x) =
• FX (∞) = limx→∞ FX (x) =
60
(h) • If X is discrete and takes only integer values,
• If X is continuous,
dFX (x)
fX (x) = .
dx
(The second equality is valid for values of x where FX (x) is differentiable.)
Note that sometimes, to calculate the PMF pf a discrete rv or PDF of a continuous rv, it is more
convenient to calculate the CDF first, and then obtain the PMF or PDF.
Ex: (Maximum of several r.v.s) Let X1 , X2 and X3 be independent rvs and X = max{X1 , X2 , X3 }.
Xi are discrete uniform taking values in {1, 2, 3, 4, 5}. Find pX (k).
61
3.2.2 Hybrid Random Variables
Random variables that are neither continuous nor discrete are called hybrid or mixed random
variables.
Ex: Assume when you go to a bus station, there is a bus with probability of 13 and you wait for
a bus with probability of 23 . If there is no bus at the stop, the next bus arrives anytime in the
next 10 minutes equiprobably. Let X be the waiting time for the bus. Find FX (x), P (X ≤ 5),
fX (x) and E[X].
62
3.3 Normal (Gaussian) Random Variables
Definition 23 A random variable X is said to normal or Gaussian if its PDF is in the form
1 (x−µ)2
fX (x) = √ e− 2σ 2 ,
2πσ 2
where µ and σ are scalar parameters characterizing the PDF and σ ≥ 0.
R∞ (x−µ)2
• One can show that √ 1 e− 2σ 2 dx = 1. (see Problem 14 of Section 3 in textbook)
−∞ 2πσ 2
• A Gaussian r.v. has several special properties. One of these properties is as follows.
(Other properties are discussed in more advanced probability or random process courses)
Definition 24 A Gaussian (normal) rv with zero mean and unit variance is called a standard
normal.s
63
• This CDF has a special notation : Φ(x) = FN (x).
• Φ(.) cannot be directly evaluated, however, it can calculated numerically (i.e. using numer-
ical software.) We will use a table to find Φ(x) for several x values:
64
• The importance of Φ(x) comes from the fact that it is used to find probabilities (or
CDF) of any normal rv Y with arbitrary mean µ and variance σ 2 :
Ex: The average height of men in Turkey is 175cm where it is believed that the height has a
normal distribution. Find the probability that the next baby boy to be born has a height more
than 200cm if the variance is 10cm. (All numbers are made up. Assume that new generations
are not growing taller.)
Ex: Signal Detection (Ex 3.7 from the textbook.) A binary message is transmitted as a signal S,
which is either +1 or −1. The communication channel corrupts the transmission with additive
Gaussian noise with mean µ = 0 and variance σ 2 . The receiver concludes that the signal +1 (or
−1) was transmitted if the received value is not negative (or negative, respectively). What is the
probability of error?
65
3.4 Multiple Continuous Random Variables
The notion of PDF is now extended to multiple continuous random variables. Notions of joint,
marginal and conditional PDF’s are discussed. Their intuitive interpretation and properties are
parallel to the discrete case of Chapter 2.
Definition 25 Two random variables associated with the same sample space are said to be jointly
continuous if there is a joint probability density function fX,Y (x, y) such that for any subset
B of the two-dimensional real plane,
Z Z
P((X, Y ) ∈ B) = fX,Y (x, y)dxdy.
(x,y)∈B
• When B is a rectangle:
• The joint PDF at a point can be approximately interpreted as the ”probability per unit
area” (or density of probability mass) near the vicinity of that point:
• Just like the joint PMF, the joint PDF contains all possible information about the individ-
ual random variables in consideration (i.e. marginal PDFs), and their dependencies (i.e.
conditional PDFs).
– As a special case, the probability of an event associated with only one of the rvs :
Z Z ∞
P(X ∈ A) = P(X ∈ A, Y ∈ (−∞, ∞)) = fX,Y (x, y)dydx.
A −∞
R
– Since P(X ∈ A) = A
fX (x)dx, the marginals are evaluated as follows :
fX (x) =
fY (y) =
66
Ex: Suppose that a steel manufacturer is concerned about the total weight of orders s/he received
during the months of January and February. Let X and Y be the weight of items ordered in
January and February, respectively. The joint probability density is given as
(
c , 5000 < x ≤ 10000, 4000 < y ≤ 9000
fX,Y (x, y) =
0 , o.w.
Ex: Random variables X and Y are jointly uniform in the shaded area S. Find the constant
c and the marginal PDFs.
Ex: Random variables X and Y are jointly uniform in the shaded area S. Find the constant c
and the marginal PDFs.
67
3.4.1 Joint CDFs, Mean, More than Two R.V.s
Joint CDF defined for random variables associated with the same random experiment :
• Advantage of joint CDF is again that it applies equally well to discrete, continuous and
hybrid random variables.
– fX,Y (x, y) =
– pX,Y (x, y) =
Ex: X and Y are jointly uniform in the shaded are. Find joint CDF FX,Y (x, y).
68
For a function Z = g(X, Y ) of two continuous rvs X and Y :
• As before, E[g(X, Y )] =
More than two continuous rvs can be elaborated in a similar fashion as two rvs :
Z Z Z
P((X, Y, Z) ∈ B) = fX,Y,Z (x, y, z)dxdydz
(x,y,z)∈B
Z
fX,Y (x, y) =
Z
fX (x) =
Z
E[g(X, Y, Z)] =
E[a1 X1 + a2 X2 + . . . + ak Xk ] =
3.5 Conditioning
Definition 26 The conditional PDF of a continuous random variable X, given and event A
with P (A) > 0, is defined as a nonnegative function fX|A that satisfies
Z
P(X ∈ B|A) = fX|A (x)dx,
B
• In the special case where we condition on an event of the form {X ∈ A} with P (X ∈ A) > 0:
P (X ∈ B, X ∈ A)
P (X ∈ B|X ∈ A) = =
P (X ∈ A)
Hence,
fX (x) , if x ∈ A
fX|{X∈A} = P({X ∈ A})
0,
otherwise.
69
– As in discrete case, conditional PDF is zero outside the conditioning set.
– Within conditioning set, conditional PDF has same shape as fX (x)
Ex: (The exponential random variable is memoryless.) Let T be the lifetime of a lightbulb,
exponential with parameter λ. Given that you check the bulb at time t and it was fine, find the
conditional CDF of the remaining lifetime X of the bulb.
• fX|C (x) =
70
Total probability theorem applied on PDFs. Let A1 , A2 , . . . An partition the sample space
with P(Ai ) > 0 for each i.
n
X
P(X ≤ x) = P (Ai )P (X ≤ x|Ai )
i=1
Z x
fX (t)dt =
t=∞
fX (x) =
Ex: (textbook) The metro train arrives at the station near your home every quarter hour starting
at 6:00 a.m. You walk into the station every morning between 7:10 and 7:30, with your start
time being random and uniformly distributed in this interval. What is the PDF of the time that
you have to wait for the first train to arrive?
Definition 27 Let X and Y be two rvs with joint PDF fX,Y (x, y). For any y with fY (y) > 0,
the conditional PDF of X given that Y = y is defined by
fX,Y (x, y)
fX|Y (x|y) = .
fY (y)
• It is best to view y as a fixed number and consider the conditional PDF fX|Y (x|y) as a
function of the single variable x only (as we did in Chapter 2)
– Conditional PDF fX|Y (x|y) has the same shape as joint PDF fX,Y (x, y) except that it
is divided by fY (y) which does not depend on x.
71
∗ Hence, finding fX|Y (x|y) from fX,Y (x, y) is similar to Chapter 2’s procedure : take
a vertical or horizontal slice and normalize (see example below.)
– Normalization property can be shown easily for conditional PDF fX|Y (x|y) :
Z ∞
=1
−∞
Pf.
Ex: (Same example we considered before to find marginals from fX,Y (x, y))
P(x ≤ X ≤ x + δ1 , y ≤ Y ≤ y + δ2 )
P(x ≤ X ≤ x + δ1 |y ≤ Y ≤ y + δ2 ) =
P(y ≤ Y ≤ y + δ2 )
≈
• fX|Y (x|y) · δ1 provides the probability that X ∈ [x, x + δ1 ] given that Y ∈ [y, y + δ2 ].
• As δ2 → 0, we have
P(x ≤ X ≤ x + δ1 |Y = y) ≈
Z
P(X ∈ A|Y = y) = fX|Y (x|y)dx
A
Ex: A grandmom gives X TL to her grandson, where X is uniformly distributed between 0 and
10. The kid spends Y TL of it in the grocery store. Not having reached the age when he would
spend all his money, the amount of money Y he spends is uniformly distributed between 0 and
X. Find fY (y).
72
3.5.2 Conditional CDF
FX|A (x) =
Ex: X is uniform in [0, 1]. Let A = {X > 21 }. Find fX|A (x) and FX|A (x).
73
Also note that one can define conditional CDF of one rv conditioned on another :
Z y
FY |X (y|x) =
−∞
fY |X (y|x) =
Ex: Consider a r.v. U which is uniform in [0, 100]. Find E[U |B] where B = {U > 60}. Compare
it to E[U ].
Pfs.
74
Ex: A coin is tossed 5 times. Knowing that the probability of heads is a r.v. P uniformly
distributed in [0.4, 0.7], find the expected value of the number of heads to be observed.
• Independence of multiple random variables. For example, for three random variables
X, Y, Z we have
• If X and Y are independent, we have (note that the reverses are not necessarily true)
– P (X ∈ A, Y ∈ B) =
– FX,Y (x, y) =
– E[XY ] =
75
– E[g(X)h(Y )] =
– var(X + Y ) =
Ex: Let X and Y be independent Normal random variables with means µX and µY and variances
2
σX and σY2 . Write the expression for their joint PDF and plot the contours on which it is constant.
76
For continuous rvs X and Y :
fX,Y (x, y)
fX|Y (x|y) = =
fY (y)
Ex: A light bulb has an exponentially distributed lifetime Y . The parameter λ of Y is a uniform
random variable in [1, 23 ]. You test a light bulb and record its life time y. What can you say
about λ ?
P(A|Y = y) ≈
P(N = n|Y = y) =
77
3.7.2 Inference based on discrete observations
The reverse of the previous case. The unobserved phenomenon is described by continuous rv
Y and the observed phenomenon by event A :
fY (y)P(A|Y = y)
fY |A (y) = =
P(A)
(This expression can be obtained by rearranging the previous expression for P(A|Y = y).)
If A is defined through a discrete rv as in A = {N = n} :
fY (y)pN |Y (n|y)
fY |N (y|n) = =
pN (n)
Ex: In a coin tossing experiment, the probability of Heads is not deterministically known, but
modeled as a random variable P uniform in [0.4, 0.6]. The coin is tossed twice and
(a) only one H is observed. Call this observation event A and find fP |A (p).
(b) two H’s are observed. Call this observation event B and find fP |B (p).
Ex: (The sum of a random number of r.v.s) You visit a random number N of stores and in the
ith store, spend a random amount of money Xi . The total amount of money you spend is
T = X1 + X2 + . . . + XN
78
where N is a positive integer r.v. with mean E[N ] and variance var(N ). The r.v.s Xi are inde-
pendent and identically distributed (i.i.d.), and have mean E[X] and variance var(X). Evaluate
the mean and variance of T .
Ex: The treasury of an underdeveloped country produces coins whose probability of heads is a
r.v. P with PDF
pep if p ∈ [0, 1]
fP (p) = .
0, o.w.
(b) Given that a coin toss resulted in heads, find the conditional PDF of P .
(c) Given that the first coin toss resulted in heads, find the conditional probability of heads on
the next toss
79
Chapter 4
Contents
4.1 Derived Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.1.1 Finding fY (y) directly from fX (x) . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.1.2 Functions of Two Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.2 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.3 Conditional Expectations Revisited (Iterated Expectations) . . . . . . . . 90
4.4 Transforms (Moment Generating Functions) . . . . . . . . . . . . . . . . . . 91
4.4.1 From Transforms to Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.4.2 Inversion of Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.4.3 Sums of Independent R.V.s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.4.4 Transforms Associated with Joint Distributions . . . . . . . . . . . . . . . . . . . 94
4.5 Sum of a Random Number of Independent R.V.s . . . . . . . . . . . . . . . 94
Remember that in Chapter 2 we discussed how to find the PMF of a discrete rv Y = g(X) from
the PMF of discrete rv X and the function g(.) :
This topic covers the same problem for continuous rvs. In particular, we discuss how to find the
distribution of Y = g(X) from the PDF of continuous rv X and the function g(.) :
80
There is a general two-step procedure for deriving the distribution (i.e. PDF) of Y (hence the
name derived distribution) :
d
fY (y) = FY (y)
dy
√
Ex: Let X be uniform on [0, 1] and Y = X. Derive distribution of Y , i.e. find fY (y).
180
Ex: Find the distribution of g(X) = X
when X ∼ U [30, 60].
81
4.1.1 Finding fY (y) directly from fX (x)
When function g(.) has special properties, some formulas can be obtained for fY (y).
1 y−b
Linear case: Y = aX + b =⇒ fY (y) = fX ( )
|a| a
2
Ex: Linear function of a normal rv X with mean µX and variance σX .
82
Note: The formula above can be directly used when the function g(x) is monotonic in the
support set of fX (x) even though g(x) is not monotonic.
Monotonic in pieces: Y = g(X) where g(.) is a general function such that for a value y0 ,
y0 = g(x) has roots x1 , x2 , . . . , xn
n
X fX (xi )
=⇒ fY (y0 ) = d
| (x)|x=xi
i=1 dx
83
Ex: (exercise) Consider random variables X and Y which are related as follows :
1, −2 ≤ X < −1,
|X|,
−1 ≤ X < +1,
Y = g(X) =
(X − 2)2 , +1 ≤ X ≤ +2
0. otherwise.
a) If X is uniformly distributed in [0, 1], determine the PDF fY (y) explicitly and plot it.
b) If X is uniformly distributed in [−1, 2], determine the PDF fY (y) explicitly and plot it.
Let Z = g(X, Y ) be a function of two random variables. fZ (z) can be determined using
same/similar two-step procedure :
d
fZ (z) = FZ (z)
dz
Ex: Let X and Y be two independent continuous random variables and Z = X + Y . Show
that, the PDF of Z is given by the ”convolution” of the PDFs fX (x) and fY (y).
84
Ex: Let X and Y be two independent discrete random variables and Z = X + Y . Show that,
the PMF of Z is given by the ”convolution” of the PMFs pX (x) and pY (y).
Ex: Let X and Y be both exponentially distributed with parameter λ and independent. Let
Z = min{X, Y }. Find the PDF of Z.
Ex: Let X and Y be independent uniform r.v.s in [−1, 1]. Compute the PDF of Z = (X + Y )2 .
85
Ex: Let X1 , X2 , X3 , . . ., be a sequence of IID (independent, identically distributed) random
variables, whose distribution is uniform in [0, 1]. Using convolution, compute and sketch the
PDF of X1 + X2 . As exercise, also compute and sketch the PDF of X1 + X2 + . . . + Xn for
n = 3, 4, and observe the trend. (As we add more and more random variables, the pdf of the
sum is getting smoother and smoother and in the limit the shape will be exactly the Gaussian
PDF. It also turns out that the Gaussian PDF is a fixed point for convolution: convolving two
Gaussian PDFs results in another Gaussian PDF.)
1.2
1.5 0.6
1
0.8
1 0.4
0.6
0.4
0.5 0.2
0.2
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3
0 0 0
0 1 2 3 4 0 1 2 3 4 5 0 1 2 3 4 5 6
0.6 0.6
0.4
0.5 0.5
0 0 0
0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 10
86
4.2 Covariance and Correlation
• A positive (negative) covariance indicates that (X − E[X]) and (Y − E[Y ]) tend to have
the same (opposite) sign.
– cov(X,X)=
– cov(X,aY+b)=
– cov(X,Y+Z)=
• Correlatedness implies dependence : (derivable from above statement, reverse not neces-
sarily true)
X and Y are correlated E[XY ] 6= E[X]E[Y ] X and Y are dependent
– Summary of implications :
87
Ex: Discrete rvs X and Y uniformly distributed on the 4 points shown below.
Pn Pn
• var(X1 + X2 + ... + Xn ) = var(X1 ) + var(X2 ) + ... + var(Xn ) + 2 i=1 j=i+1 cov(Xi , Xj )
Pf.
cov(X, Y )
ρ(X, Y ) = p
var(X)var(Y )
• −1 ≤ ρ(X, Y ) ≤ 1 (pf based on the Schwartz inequality (E 2 [AB] ≤ E[A2 ]E[B 2 ]))
• ρ(X, Y ) = +1 ⇐⇒ Y = aX + b, a > 0
88
ρ(X, Y ) = −1 ⇐⇒ Y = aX + b, a < 0
Ex: Consider n independent fair coin tosses with probability heads equal to p. X: number of
heads, Y =number of tails. Find the correlation coefficient of X and Y .
(b) Deviating from the Bernoulli process, let us assume that there is correlation between Xi ’s:
cov(Xi , Xj ) = c|i−j| σX
2
with 1 > c > 0. Does the variance change?
89
4.3 Conditional Expectations Revisited (Iterated Expectations)
Ex: Break a stick uniformly at random. Keep the piece that contains the left end. Now, break
this piece and let X be the length of the piece that contains the left end. Find E[X].
Note that the same idea can be extended to expected value of a function of X :
E[g(X)] = E[E[g(X)|Y ]]
90
Ex: Same stick breaking example. Find var(X).
91
Ex: MGF of a standard normal rv and a normal rv with mean µ and variance σ 2 .
The name moment generating function follows from the following property.
dn
E[X n ] = MX (s) |s=0
dsn
Proof:
Ex: For discrete rv X in previous question, find first and second moments.
92
4.4.2 Inversion of Transforms
Inversion Property: The transform MX (s) associated with a r.v. X uniquely determines the
CDF of X, assuming that MX (s) is finite for all s in some interval [−a, a], where a is a positive
number.
• Transform MX (s) of X uniquely determines the CDF (and therefore PDF or PMF) of X.
– There are formulas for inversion of transforms which are usually difficult to use.
– Transforms are rather inverted by pattern matching based on tables.
Ex: The transform associated with a r.v. X is given as MX (s) = 13 e−2s + 12 + 16 e3s . What is the
corresponding distribution?
pes
Ex: Find the distribution of X for MX (s) = 1−(1−p)es
.
16 − 4s + 8 − 4/3s
Ex: Find distribution for MX (s) = .
(6 − s)(4 − s)
93
Similarly, if Z = X1 + X2 + ... + Xn where Xi are independent, then MGF of Z is
Linear function of independent Gaussian rvs: Linear function of independent Gaussian rvs
is also Gaussian.
Consider rvs X1 , X2 , . . . , Xn with a joint PDF. Their MGF (or multivariate tranbsform) is then
defined by
MX1 ,X2 ,...,Xn (s1 , s2 , . . . , sn ) = E[es1 X1 +s2 X2 +...+sn Xn ].
Y = X1 + X2 + . . . + XN
N:
X1 , X2 , ..., Xi are iid
N, X1 , X2 , ..., Xi are independent
94
Show the following equalities, which relate the mean, variance and the transform of Y to similar
quantities of Xi and N :
1. E[Y ] = E[X]E[N ]
• E[Y ] = E[X]E[N ]
95
• MY (s) = MN (log(MX (s)))
Ex: Behzat visits a number of bookstores, looking for The Loneliness of the Long Distance
Runner. Any given bookstore carries the book with probability p, independent of the others. In
a typical bookstore visited, Behzat spends a random amount of time, exponentially distributed
with parameter λ, until he finds the book or determines that the bookstore doesn’t carry it. We
assume that Behzat will keep visiting bookstores until he buys the book and that the time spent
in each is independent of everything else. We wish to find the mean, variance, and PDF of the
total time spent in this quest.
96
Chapter 5
Limit Theorems
Contents
5.1 Markov and Chebychev Inequalities . . . . . . . . . . . . . . . . . . . . . . . 97
5.2 The Weak Law of Large Numbers (WLLN) . . . . . . . . . . . . . . . . . . . 99
5.3 Convergence in Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.4 The Central Limit Theorem (CLT) . . . . . . . . . . . . . . . . . . . . . . . . 101
E[X]
P(X ≥ a) ≤ .
a
Proof:
Ex: Use Markov inequality to bound the probability that a standard normal random variable
exceeds twice its standard deviation, in absolute value.
97
Ex: Let X be uniform in [5, 10]. Compute probabilities that X exceeds certain values and
compare them with the bound given by Markov inequality.
To be able to bound probabilities for general random variables (not necessarily positive), and to
get a tighter bound, we can apply Markov inequality to (X − E[X])2 and obtain the following.
Chebychev Inequality: For random variable X with mean E[X] and variance σ 2 ,
σ2
P(|X − E[X]| ≥ a) ≤ .
a2
Proof:
Note that Chebychev’s inequality uses more information about X than the Markov inequality
and thus can provide a tighter bound about the probabilities related to X.
• In addition to the mean (a first-order statistic), Chebychev’s inequality also uses the vari-
ance, which is a second-order statistic.
(You can easily imagine two very different random variables with the same mean: for ex-
ample, an exponential r.v. with mean 1 and variance 1, and a discrete random variable that
takes on the values 0.9, and 1.1 equally probably. Markov inequality does not distinguish
between these distributions, whereas Chebychev inequality does.)
Ex: Apply Chebychev inequality instead of Markov in the two examples above.
98
5.2 The Weak Law of Large Numbers (WLLN)
X1 + X 2 + . . . + Xn
Mn =
n
satisfies the following expression for any > 0 :
X1 + X2 + . . . + Xn
P(|Mn − µ| ≥ ) = P( − µ ≥ )) → 0 as n → ∞.
n
• The WLLN asserts that the sample mean Mn of a large number of iid rvs is very close to
the true mean E[X] = µ, with high probability.
σ 2
• If we refer to as accuracy level and δ = n 2 as the confidence level : for any given
level of accuracy and confidence, sample mean Mn will be equal to µ, within these levels of
accuracy and confidence, by just choosing n sufficiently large.
Ex: (Probabilities and frequencies) An event A with probability p = P (A). The empirical
frequency of A in n repetitions is the fraction of time event A occurs.
99
Ex: Polling: We want to estimate the fraction of the population that will vote for XYZ. Let Xi
be equal to 1 if the ith person votes in favor of XYZ, and 0 otherwise. How many people should
we poll, to make sure our error will be less than 0.01 with 95% probability?
• Intuitively, if limn→∞ an = a, then the magnitude difference between a and an can be made
smaller than any desired accuracy level by just choosing n sufficiently large.
( lim P(|Yn − a| ≥ ) = 0 )
n→∞
100
• If Yn converges in probability to a, this means that ”almost all” of the probability of
PDF/PMF of Yn is concentrated within of a for large values of n.
CLT: Let X1 , X2 , . . . be a sequence of IID random variables with E[Xi ] = µ, var(Xi ) = σ 2 , and
define
Sn = X1 + X2 + . . . + Xn
• E[Sn ] = , var(Sn ) =
• E[Zn ] = , var(Zn ) = .)
Rz 2
Then, the CDF of Zn converges to the standard normal CDF Φ(z) = √1 e−x /2 dx in the
−∞ 2π
sense that
lim P(Zn ≤ z) = Φ(z)
n→∞
for every z.
• CLT suggests that the sum of a large number of IID rvs is approximately normal.
• This is a very common situation where a random effect originates from many small random
factors. Practically, CLT simplifies probabilistic elaboration of many problems.
• One thing to be careful about while using the CLT is that this is a ”central” property, i.e.,
it works well around the mean of the distribution, but not necessarily at the tails (consider,
for example, Xi being strictly positive and discrete- then Zn has absolutely no probability
mass below zero, whereas the Gaussian distribution does.)
Simple Proof of the Central Limit Theorem: We will use transforms. For simplicity,
assume E[X] = 0, σ = 1. Consider the transform of Zn , MZn (s) when n → ∞. Recall that
2 3 Xi
ex = 1 + x + x2! + x3! + . . .. Note that √ n
is very close to 0 for large n.
101
n h Xi i
X1 +X2 +...+Xn X X
sX
h i h 1 s √2
i
√n
s √ s √n
Y s√
MZn (s) = E e n =E e e n ...e n = E e n
i=1
n 2 3
Y Xi 2 Xi 3 Xi
= E 1 + s√ + s +s + ...
i=1
n 2n 3!n3/2
n n
s2 /2 s2 /2
2
Y
lim MZn (s) = 1+ = 1+ = es /2
n→∞
i=1
n n
Since the transform MZn (s) converges to that of a standard Gaussian, and there is a one-to-one
mapping between transforms and distributions, we conclude that CDF of Zn converges to that
of a standard Gaussian.
Ex: You download 300 mp3 files from a website each month where your limit in MBytes is 1700.
The mp3 file sizes are uniformly distributed in [3, 8] and independent from the other file sizes.
What is the probability that you have to pay extra in a month?
Ex: Promoting a new product, a grocery store has an eye-catching stand which makes people
buy the product with probability 3/10. Assuming that the purchasing behavior of the 100 people
who visited the store are independent, what is the probability that less than 40 items are sold?
102