T4 Probability
T4 Probability
Probability
1. All definitions agree on the algebraic and arithmetic procedures that must be
followed; hence, the definition does not influence the outcome.
The frequentist approach is based on the notion of statistical regularity; i.e., in the
long run, over replicates, the cumulative relative frequency of an event (E) stabilizes.
The best way to illustrate this is with an example experiment that we run many
times and measure the cumulative relative frequency (crf). The crf is simply the
relative frequency computed cumulatively over some number of replicates of
samples, each with a space S.
Suppose we have a treatment for high blood pressure. The event, E, we are
interested in is successfully controlling the blood pressure. So, we want to be able to
make a prediction about the probability that a patient treated in the future will have
blood pressure under control, P(E). To estimate this probability we conduct an
experiment that is replicated over time in months. The data are presented in the
table below.
The crf values down the right most column fluctuate the most in the beginning, but
rapidly stabilize. Statistical regularity is the stabilization of the crf in the face of
individual fluctuations form month to month in the relative frequency of E.
We can get an idea of this by using an example with “nearly infinite” replications.
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 2500 5000 7500 10000
For all probability models to give consistent results about the outcomes of future
events they need to obey four simple axioms (Kolmogorov 1933).
Probability axioms:
3. When events E and F are disjoint, they cannot occur together. The
probability of disjoint events E or F = P(E or F) = P(E) + P(F).
Product rule:
The product rule applies when two events E1 and E2 are independent. E1 and
E2 are independent if the occurrence or non-occurrence of E1 does not change
the probability of E2 [and vice versa]. [A further statistical definition requires the
use of the multiplication theorem]
⎛n⎞ k
P = ⎜⎜ ⎟⎟( p ) (1 − p )
n−k
⎝k ⎠
⎛n⎞ n!
⎜⎜ ⎟⎟ =
⎝ ⎠
k k !(n − k )!
If we had a fair coin we could predict the probability of specific outcomes (e.g., 1
head & 1 tail in two tosses) by setting the p parameter equal to 0.5. Note that the
model does not require this. In the case of the coin toss, we are interested in a
conditional probability; i.e., what is the probability of obtaining, say, 5 heads given a
fair coin (p = 0.5) and 12 tosses, or P(k=5 | p=0.5, n=12).
Let’s continue to use the familiar coin tossing experiment to examine this inversion.
⎛n⎞
P = ⎜⎜ ⎟⎟(1 / 2 ) (1 / 2)
k n−k
⎝k ⎠
⎛n⎞ n!
⎜⎜ ⎟⎟ =
⎝k ⎠ k !( − k )!
n
The question is the same: “If I toss a fair coin 12 times, what is the probability that I
will obtain 5 heads and 7 tails?”
The answer comes directly from the above formula where n = 12, and k = 5. The
probability of such a future event is 0.193359.
From the probability perspective we can look at the distribution of all possible
outcomes
This is the distribution of mutually exclusive outcomes that comprise the set of all
possible outcomes under the model where p = 0.5. Remember probability axiom 2
where P(S) = 1; the probability of each outcome (i.e., 0 to 12 heads) sum to 1.
CASE 2: LIKELIHOOD.
The second question is: “What is the probability that my coin is fair if I tossed it 12
times and observed 5 heads and 7 tails?”
We have inverted the problem. In the previous case (1) we were interested in the
probability of a future outcome given that my coin is fair. In this case (2) we are
interested in the probability hat my coin is fair, given a particular outcome.
So, in the likelihood framework we have inverted the question such that the
hypothesis (H) is variable, and the outcome (let’s call it the data, D) is constant.
A problem: What we want to measure is P(H|D). The problem is that we can’t work
with the probability of a hypothesis, only the relative frequencies of outcomes. The
solution comes from the knowledge that there is a relationship between P(H|D) and
P(D|H):
The P(H|D) = αP(D|H)
The likelihood of the hypothesis given the data, L(H|D), is proportional to the
probability of the data given the hypothesis, P(D|H). As long as we stick to
comparing hypotheses on the same data and probability model, the constant remains
the same, and we can compare the likelihood scores. We cannot make comparisons
on different data using likelihoods.
Let’s use the binomial model to look at the application of probability as compared
with likelihood.
PROBABILITIES Data
D1: 1H & 1T D2: 2H
Hypotheses H1: p(H) = 1/4 0.375 0.0625
H2: p(H) = 1/2 0.5 0.25
Following the probability axioms, and as we saw in the binomial distribution above,
given a singe hypothesis (i.e., H2: p(H) = 0.5), the different outcomes can be
summed. For example P(D1 or D2|H2) = P(D1|H2) + P(D2|H2), a well known
result; with all possible outcomes summing to 1. However, we cannot use the
addition axiom over different hypotheses H1 and H2; i.e., P(D1|H1 or D2|H2) ≠
P(D1|H1) + P(D2|H2).
LIKELIHOODS Data
D1: 1H & 1T D2: 2H
Hypotheses H1: p(H) = 1/4 α1 × 0.375 α2 × 0.0625
H2: p(H) = 1/2 α1 × 0.5 α2 × 0.25
Under likelihood we can work with different hypotheses as long as we stick to the
same dataset. Take the likelihoods of H1 and H2 under D1. We can infer that the
H1 is ¾ less likely than H2. Note that when working with likelihoods, we compute
the probabilities, and we drop the constant for convenience. The likelihoods do not
sum to 1 because the probabilities terms are for the same outcome drawn from
different distributions [probabilities for the total set of outcomes S in same
distribution sum to 1].
Let’s use likelihood to follow through on our question of the probability that the coin
is fair given 12 tosses with 5 heads and 7 tails. As always our tosses are
independent.
Perhaps there is an alternative hypothesis; i.e., where p ≠ 0.05, that has a higher
likelihood. To explore this possibility we take the binomial formula as our likelihood
function and evaluate the resulting likelihoods with respect to various values of p and
the given data. The results can be plotted as a curve; this curve is sometimes called
the likelihood surface. The curve for our data (12,5) is shown below.
0.25
0.2
0.15
0.1
0.05
0
0 0.2 0.4 0.6 0.8 1
ML estimate of p = 0.42
IMPORTANT NOTE: It looks like a distribution, but don’t be fooled, the area under the
curve does not sum to 1. The curve reflects the probabilities of different values of p
(a parameter of the model) under the same data, and these are not mutually
exclusive outcomes within a single set of all the possible outcomes.