Chapter 4
Chapter 4
340
320
300
observation #
370
How would
you guess µ?
360
first 10
n
1
Since, µ = E(X) ≈ ∑ Xi (for large n)
n i=1
A reasonable guess or estimate for µ would be the
average of the observed sample of observations.
1 n
X = ∑ Xi
n i=1
is our estimate of
µ = E( Xi )
µ is sometimes called the population mean since
it is the mean of the entire population of
all the potential values, whereas the sample mean is
just the average of some of them.
© Imperial College Business School
4.2 The Distribution of the Normal Sample Mean
n
1
X = ∑ Xi
n i=1
What are their chances of getting the right value?
© Imperial College Business School
Key idea:
OK? Do not worry, I will return to this idea over and over again!
Expected value
1 n 1 n
E ( X ) = E ∑ X i = ∑ E( X i )
n i =1 n i=1
1 n 1
= ∑ µ = nµ = µ
n i =1 n
The estimator is unbiased: on average we get exactly what we are
after, µ that is. We expect Xbar to give us (on average) the right
answer when we plug in the data to compute µ.
Let, Var(Xi ) = σ2 Variance:
1 n
Then, Var(X) = Var ∑ Xi =
n i=1
Var( n1 X1 + n1 X2 + n1 X3 + L + n1 Xn ) =
1
n2
Var(X1 ) + n12 Var(X2 ) + n12 Var(X3 ) + L + n12 Var(Xn )
= 1
n2
σ2 + n12 σ2 + n12 σ2 + L + n12 σ2
n
1 1
n2
∑
i =1
σ =2
n 2
n σ 2
σ2
=
n
The variance is inversely related to the number of observations in the
sample. As we increase n we capture µ better because the estimator is
less “dispersed” around its expected value which is µ.
To summarize:
• as an estimator for µ (up front, before plugging in the
observations), Xbar has nice statistical properties
• Its expected value is µ (the value we are after). Its
variance (its expected ‘dispersion’ around µ)
decreases as the number of observations increases.
• In particular, if n were very, very large, then we would
be sure to capture µ accurately.
• In other words, the larger the sample, the better!
• Makes sense, right?
σ 2
Then, X ~ N(µ, ) (this is the sampling distribution
n of the estimator)
σ
Let σX =
n
This will simplify the look of the formulae and emphasize
that the sample mean has its own standard deviation (we
derived the variance earlier, right? The standard deviation
is just the square root).
X−µ
X ~ N(µ, σ ) ⇒ 2
~ N(0,1)
σXX
So,
X−µ
Pr( −2 ≤ ≤ 2) = .95
σX
(really the 2 is 1.96!)
© Imperial College Business School
So, Pr(−2σX ≤ X −µ ≤ 2σX ) = .95
Pr( − X − 2σ X ≤ −µ ≤ − X + 2σ X ) = .95
σ σ
Pr(X − 2 ≤µ ≤ X+2 ) = .95
n n
For i.i.d. normal data with known standard
deviation σ, a 95% confidence interval for
the true mean µ is,
( X − 2σ X
, X + 2σ X
)
= X ± 2σ X
σ
= X ±2
n
Interpret. 95% of the time the true value
of µ will be contained in the interval.
Example:
σ
Then, σX = = .67
n >> k1 = 344.83-2*.67
The sample average with 500 ans = 343.490
observations was 344.83. >> k2 = 344.83+2*.67
is (343.5, 346.17)
Example:
Consider σ2
n
1
We use s 2x = ∑
n − 1 i =1
( xi − x ) 2
to estimate it.
n
1
sx = ∑
n − 1 i =1
( xi − x) 2
σ
σX =
n
we use an estimate of it, namely
sx
se(X) =
n
(just replace
σ with its
estimate)
x ± 2se(X)
One of these
0.4
is t with 30
d.f., the other
0.3 is standard
normal
nu3
0.2
-4 -3 -2 -1 0 1 2 3 4
t
All t’s are centered at 0 and look like the standard normal.
They have a bell shape.
For smaller ν, the t puts more probability in the “tails” than the standard
normal.
Histogram of draws from the t(3)
It is like the standard normal in the middle but sometimes you can get
large values (negative or positive).
© Imperial College Business School
Important: For our normal mean problem we use ν=n-1.
The number of degrees of freedom is equal to the number
of observations minus one.
0.3 .95
probability in here
f(x)
0.2
.025
probability 0.1 .025
in here
0.0
probability in here
-3 -2 -1 0 1 2 3
− t n −1,.025 x t n −1,.025
Interpret:
tn−1,.025 and
tn−1,.025
are the numbers such that we have 95% probability
of being between them?
t n −1,.025 ≈ 2
For smaller n, the t value gets bigger than 2.
Here is a table
of t values and n.
t n−1,.025 n
We can see that
for n>30 (or even
4.303 3
about 20) the t value 2.228 11
is about 2. Clearly, 2.086 21
-t would be the same
value with opposite
2.042 31
sign due to the symmetry 2.00 61
of the t distribution.
Note: Traditionally you get these t cut-off values
from a table in the back of a statistics tome. We
can also employ MatLab.
X−µ
Pr( −tn−1,.025 ≤ ≤ tn−1,.025 ) = .95
se(X)
x ± tn−1,.025 se(X)
© Imperial College Business School
An exact 95% confidence interval for µ with
σ unknown is
x ± t n−1,.025 se(X)
Of course, using the t cut-off instead of 2 will make the
interval bigger for smaller n.
This accounts for the fact that we are not sure that our
estimate for s is quite right.
© Imperial College Business School
Example:
Back to our weight data.
With n=500 the sample standard deviation is 15.455 and
the sample mean is 344.83.
The relevant t distribution for us is the one with ν=499 which
is just like the standard normal. So, the t-value is about 2.
15.455
Compute se(X) = = .69 >> k1 = 344.83-1.4
500 ans = 343.430
>> k2 = 344.83+1.4
The confidence interval is: ans = 346.230
344.83 +/- 1.4
>>hedges=(290+(1:10:120+0) )' sx
>> n=histc(weights,hedges); se( X ) =
>> bar(hedges,n,5,'m'); n
T Confidence Intervals
Variable N Mean StDev SE Mean 95.0 % CI
weights 500 344.828 15.455 0.691 ( 343.470, 346.186)
Histogram of weights
(with 95% t-confidence interval for the mean)
60
40
Frequency
20
0 _
X
[ ]
weights
What if we just use the first 10 observations?
Compute 14.6
se(X) = = 4.6
10
The confidence interval becomes:
348.5 +/- 2.262*4.6 = 348.5 +/- 10.4 = (338.1, 358.9)
© Imperial College Business School
Variable N Mean StDev SE Mean 95.0 % CI
weights1 10 348.51 14.60 4.62 (338.07,358.96)
Histogram of weights10
(with 95% t-confidence interval for the mean)
3
Frequency
0 _
X
[ ]
weights10
Let us get a 95% confidence interval for the Example:
true mean of the Canadian returns.
>>mi=min(canada); ma=max(canada); dd=ma-mi;
>>hedges=(mi-0.05+(1:1:dd*100+10)/100 )'; sx
>>n=histc(canada,hedges); se( X ) = x ± tn−1,.025 se(X)
>>bar(hedges,n,1); n
>>title ‘canada';
Is the confidence
interval big?
4.5 The Central Limit Theorem &
General Approximate Confidence Interval for µ
σ 2
E(X) = µ Var(X) =
n
Nonetheless, if the X’s are not i.i.d. normal we do not have
σ2
X ~ N(µ, )
n
precisely.
© Imperial College Business School
However, by the central limit theorem
we can write the following approximation
X−µ X−µ
≈ N(0,1) ⇒ Pr( −2 < < 2) ≈ .95
se(X) se(X)
x ± 2se(X)
© Imperial College Business School
Example:
Minutes the service is used
over a given time period
90
Random sample 80
70
of customers of 60
Frequency
a company providing 50
40
The data does not look normal but, as pointed out earlier,
the CI for the mean is still approximately valid
the 5000
4000
Volume
cattle pit.
3000
2000
Is the data i.i.d. from some 1000
distribution? 0
Frequency
true daily mean value.
30
20
Note: if the data is not i.i.d. 10
similar methods apply. 0
case
the data estimate is sample proportion = p̂
must ˆ − p)
p(1 ˆ
normal data approx ci: pˆ ± 2se(p)
ˆ = pˆ ± 2
be i.i.d. n
no
yes
Well, if you knew µ and σ you could make statements like: with 95% probability the
next observation (the return tomorrow) will be between µ-2σ and µ+2σ (we know
that this is true due to the properties of the normal, right?).
This is a powerful (and useful) statement for prediction (suppose you are deciding
whether you want to invest in a certain asset based on its future performance).
But what if you do not know µ and σ (and this is typically the case)? We simply
estimate them based on the available data. In the end we come up with statements
like the one above for intervals that are constructed using estimated quantities.
We use the past (the data) to understand (or predict) the future (returns tomorrow,
for example).
In general: suppose the data are i.i.d. normal.
The fundamental question is: How do you predict
the next one?
0.4
If you knew
µ and σ your 0.3
95% predictive
0.2
f(x)
interval would
be (µ−2σ,µ+2σ) 0.1
Normal).
µ
µ − 2σ µ + 2σ
The problem is that we do not know what the true
parameters (i.e., µ and σ) are. We have to infer them from
the data. We estimate them.
(x − 2s x , x + 2s x )
Now we have the main intuition of what we do.
(x − 2s x , x + 2s x )
346
10 weights:
344
342
weights
pi ci
340
338
Note:
336
The confidence
interval is an 2 4 6 8 10
average weight µ.
346
The predictive 500 weights:
344
interval is an
interval for the next
342
individual weight.
Such weight can be pi ci weights
340
below or above µ.
338
336
1) We take the data (the past) and analyze it using descriptive tools
(Chapter 1 in the notes, but we’ll see more sophisticated ways to analyze
the data in the next chapters)
3) Once we have a model for the data, we simply have to estimate the
parameters of the model (examples: µ and σ in the normal case, p in the
Bernoulli case) to do prediction (understand the future). Again, prediction
is the bottom line of what we do. (Notice: without a model for the data we
cannot do prediction. For example: our interval µ-2σ, µ+2σ is based on
the normal model.)