Unit 2
Unit 2
Unit 2
0 Properties of estimators
Properties of estimators are all consequences of their sampling distributions. Most of the
time, the full sampling distribution of ˆ is not available; therefore, we focus on properties
that do not require complete knowledge of the sampling distribution.
2.1 Unbiasedness
The first, and probably the simplest, property is called unbiasedness. In words, an estimator
ˆ is unbiased if, when applied to many different samples from F , ˆ equals the true
estimator T of is said to be unbiased estimator of if and only if (iif) E (T) = for all
values of in , where , is the parameter space.
That is, no matter the actual value of θ, if we apply ˆ = ( X 1 , X 2 , ..., X n ) to many datasets
X 1 , X 2 , ..., X n sampled from F , then the average of these ˆ values will equal θ—in other
words, E ( ˆ ) = θ for all θ. This is clearly not an unreasonable property, and a lot of work in
mathematical statistics has focused on unbiased estimation.
Example 2.1.1. Let X 1 , X 2 , ..., X n be iid from some sample size n of normal distribution
having mean μ and variance 2 . This distribution could be normal, but it need not be.
Consider ̂ X , the sample mean, and ˆ 2 S 2 , the sample variance. Then ̂ and ˆ 2 are
Proof
Proof: Let X 1 , ..., X n be independent selected random observation drawn from a population
To show that sample variance is an unbiased estimator of the population variance i,e
E(S 2 ) 2
variance 2 .
Proof
E( X ) , Var(X)= 2
E( X ) E( X )
E (cX ) cE ( X )
Var ( X ) E ( X 2 ) E ( X )
2
E( X 2 ) 2 2
Var ( X ) E ( X 2 ) [ E ( X )]2
2
E( X 2 ) 2
n
E ( X X )2
E ( X 2 XX X
2 2
2
E X 2 x X X
2 2
We know E ( X 2 ) 2 and
n
E x nX 2 2
E ( X 2 ) E (nX 2 )
E( X 2 ) 2 2
2
( ) n(
2 2
2)
n
n 2 n 2 2 n 2
So we write
( X X )2
E(S ) E
2
n 1
E[ ( X X ) 2
n 1
1(n n 2 n 2
2
n 1
(n 1) 2
n 1
E(S )
2 2
Therefore, the sample variance is an unbiased estimator of the population variance, regardless
of the model.
While unbiasedness is a nice property for an estimator to have, it doesn’t carry too much
weight. Specifically, an estimator can be unbiased but otherwise very poor. For an extreme
1
example, suppose that P ˆ 10 5 P ˆ 10 5 . In this case, E ˆ , but ˆ is
2
always very far away from θ. There is also a well-known phenomenon (bias–variance trade-
off) which says that often allowing the bias to be non-zero will improve on estimation
accuracy; more on this below. The following example highlights some of the problems of
focusing primarily on the unbiasedness property.
Example 2.1.2. Let X be a sample from a Pois (θ) distribution. Suppose the goal is to
estimate e 2 , not θ itself. We know that ˆ =X is an unbiased estimator of θ. However,
the natural estimator e 2 X is not an unbiased estimator of e 2 .Consider instead ˆ (1) X .
This estimator is unbiased:
e (1) X x ( ) x
E (1) X e e e e 2 . In fact, it can even be shown that
x 0 x! x 0 x!
(1) X is the “best” of all unbiased estimators; cf. the Lehmann–Scheffe theorem. But even -
though it’s unbiased, it can only take values ±1 so, depending on θ, (1) X may never be close
to e 2 .
1
Example2.1.3 If X is a Bernoulli random variable with parameter p, then: X
n
X Is the
maximum likelihood estimator (MLE) of p. is the MLE of an unbiased estimator of p?
1 1 1 1
Therefore: E ( pˆ ) E X E ( X ) p (np ) p
n n n n
The first equality holds because we've merely replaced p̂ with its definition. The second
equality holds by the rules of expectation for a linear combination. The third equality holds
because E ( X ) p . The fourth equality holds because when you add the value p up n times,
you get np . And, of course, the last equality is simple algebra.
Example 2.1.4 .Let X 1 , X 2 , ..., X n be iid Ber(θ) and suppose we want to estimate
,the so-called odds ratio. Suppose ̂ is an unbiased estimator of η, so that
1
E (ˆ ) for all θ or, equivalently, (1 ) E (ˆ ) 0 (1−θ) for all θ.
1
Here the joint PMF of ( X , ... , , X n ) is f ( x1 , ... , xn ) x1 x2 ... xn (1 ) n ( x1 ,... xn ) . Writing out
polynomial in θ of degree n+1. From the Fundamental Theorem of Algebra, there can be at
most n+ 1 real roots of the above equation. However, unbiasedness requires that there be
infinitely many roots. This contradicts the fundamental theorem, so we must conclude that
there are no unbiased estimators of η.
Another useful though perhaps crude measure of closeness of an estimator is the mean-square
error (MSE) of the estimator.
Remark ,
, where bias of
. Thus if an estimator is unbiased then its mean-squared error is
equal to its variance.
Example 2.1.5:
………(*)
…………(**), since W is
unbiased.
Substituting the expression for Var(T) in (*) and (**) after simplification we get
Clearly , for
any n greater than one and are equal for n=1. Therefore W is better than T as an estimator of
in the MSE sense.
2.2 Consistency
Another reasonable property is that the estimator ˆ ˆn ˆ, which depends on the sample size
n through the dependence on X 1 , X 2 , ..., X n , should get close to the true θ as n gets larger and
P ˆ 0 as n for 0
The Usual Setup is that we are interested in the value of some parameter θ that describes a
feature of a population. We draw a random sample from the population, X 1 , X 2 , ..., X n , and
The idea is that we’d like ˆ to get “closer” and closer to θ as we draw larger and larger
samples.
lim P( n ) 0
n
2
The sample mean is consistent for since var( X ) and
n
2
P X var( X )
2
2
n 0 as n
Definition 2.2.1 Let T and { Tn : n 1 } be random variables in a common sample space. Then
Tn converges to T in probability if, for any ε >0, lim PTn T 0 . The law of large
n
numbers (Law of Large Numbers, LLN)). If X 1 , X 2 , ..., X n are iid with mean μ and finite
n
variance 2 , then ̄ X n n 1 X i converges in probability to μ. The LLN is a powerful result
i 1
and will be used throughout the course. Two useful tools for proving convergence in
probability are the inequalities of Markov and Chebyshev’s.
Markov’s inequality. Let X be a positive random variable, i.e., P(X >0) = 1. Then, for
any ε >0, P(X > ε) ≤ 1 E(X).
Chebyshev’s inequality. Let X be a random variable with mean μ and variance
σ2.Then, for any ε >0, P{|X−μ|> ε}≤ 2 2 .
It is through convergence in probability that we can say that an estimator ˆ ˆn gets
close to the estimand θ as n gets large.
Definition 2.2.2. An estimator ˆn of θ is consistent if ˆn in probability. A rough way to
understand consistency of an estimator ˆn of θ is that the sampling distribution of ˆn gets
more and more concentrated as n→ ∞. The following example demonstrates both a
theoretical verification of consistency and a visual confirmation via Monte Carlo.
Y
Example2.2.1: Suppose Y ~. Bin(n, ) where the probability of success is. Show that ˆ is
n
a consistent estimator of the population parameter .
Y 1 1
E ˆ E E (Y ) n . Therefore, ˆ is an unbiased estimator of .
n n n
Y 1 1 (1 )
var(ˆ) var 2 var(Y ) 2 n (1 )
n n n n
(1 )
By Chebyshev’s inequality SE (ˆ) var(ˆ) tends to zero as n
n
Example 2.2.2. Recall the setup of Example 2.3. It follows immediately from the LLN that
̂ n X is a consistent estimator of the mean μ. Moreover, the sample variance ˆ n2 S 2 is
convergence in probability to 2 2 by the LLN applied to the X i2 s ; the second term in the
braces converges in probability to 2 by the LLN (see, also, the Continuous Mapping
ε implies| ˆn −θ | > δ. Then the probability of the event on the left is no more than the
probability of the event on the right, and this latter probability vanishes as n→∞ by
assumption. Therefore
lim P g (ˆ) g ( ) 0 .
n
for the Poisson distribution, it follows that both ˆn X and ̃ n S 2 are unbiased and
consistent for θ. Another comparison of these two estimators is by considering a new
estimator ˆn X S 2 1
2
1
. Define the function g ( x1 , x2 ) ( x1 x2 ) 2 . Clearly g is continuous
~
(why?). Since the pair (ˆn , ) is a consistent estimator of (θ, θ), it follows from the
~
continuous mapping theorem that ̇ ˆn g (ˆn ) is a consistent estimator of g ( , ) . Like
with unbiasedness, consistency is a nice property for an estimator to have. But consistency
alone is not enough to make an estimator a good one.
Example 2.2.4. Let X 1 , X 2 , ..., X n be iid N(θ,1). Consider the estimator
ˆ 10 7 if n 10 750
n
X Otherwise
Show that ˆn is a consistency estimator
Solution
Let N= 10 750 . Although N is very large, it’s ultimately finite and can have no effect on the
limit. To see this, fix ε >0 and define
a n P ˆn
and bn P X n .Since bn→0 by the LLN, and a n bn for all
n ≥ N, it follows that a n 0 and, hence, ˆn is consistent. However, for any reasonable
variance will tend to be more concentrated about the parameter and therefore preferable. In
some cases one unbiased estimator may have smaller variance for some values of . In certain
cases it is possible to show that a particular unbiased estimator has the smallest possible
variance among all possible unbiased estimators for all values of .
(i) T* is unbiased
(ii) Var(T*) < Var(T) for any other unbiased estimator T of for all , where is the
The next theorem gives a partial answer to the first question, it gives a lower bound for the
variance of an unbiased estimator for certain probability functions. There is no answer to the
second question.
then = CRLB
Note:
Example 2.2.5
a) Consider a random sample from an exponential distribution with mean θ. Find the CRLB
unbiased and its variance is equal to the CRLB, therefore is UMVUE for θ
b) Let us consider a random sample from Bernoulli distribution, find the CRLB for P.
the variance of the sample proportion, therefore the sample proportion is UMVUE for
P.
c) Consider a random sample from Normal distribution with mean and variance . Find
CRLB for and .
The mle for μ and are respectively. Note that is unbiased for and
UMVUE.
d) Consider sampling from uniform on the interval (0 , θ). Find CRLB for θ.
Remark In some text books UMVUE is simply UMVE (unbiased minimum variance
estimator) or MVUE (minimum variance unbiased estimator) without uniformly.
clearly the deviation from ˆ to the true value of , ˆ , measures the quality of the
estimator, or equivalently, we can use (ˆ ) 2 for the ease of computation. Since ˆ is a
random variable, we should take average to evaluation the quality of the estimator. Thus, we
introduce the following
Definition2.2.5: The mean square error (MSE) of an estimator ˆ of a parameter is the
function of defined by E (ˆ ) 2 , and this is denoted as MSE .
This is also called the risk function of an estimator, with (ˆ ) 2 called the quadratic loss
function. The expectation is with respect to the random variables X 1 ,..., X n since they are the
only random components in the expression.
Notice that the MSE measures the average squared difference between the estimator ˆ and
the parameter , a somewhat reasonable measure of performance for an estimator. In general,
any increasing function of the absolute distance ˆ would serve to measure the goodness
of an estimator (mean absolute error, E ˆ , is a reasonable alternative. But MSE has at
least two advantages over other distance measures: First, it is analytically tractable and,
secondly, it has the interpretation
MSE E ˆ 2
var(ˆ) E (ˆ)
2
var(ˆ) ( Bias of ˆ) 2
This is so because
E ˆ
2
E (ˆ 2 ) E (9 2 ) 2E (ˆ)
2
var(ˆ) E (ˆ) 2 2E (ˆ)
var(ˆ) [ E (ˆ) ]2
Definition 2.2.6: The bias of an estimator ˆ of a parameter is the difference between the
expected value ˆ and ; that is, Bias ( ˆ ) = E ( ˆ )- . An estimator whose bias is identically
equal to 0 is called unbiased estimator and satisfies E ( ˆ ) = for all .
Thus, MSE has two components, one measures the variability of the estimator (precision) and
the other measures the bias (accuracy). An estimator that has good MSE properties has small
combined variance and bias. To find an estimator with good MSE properties, we need to find
estimators that control both variance and bias.
For an unbiased estimator ˆ , we have MSEˆ E (ˆ ) var(ˆ) , and so, if an estimator is
1 x
f (x ) exp , the maximum likelihood estimator for
2
n
X i
ˆ i 1
, is unbiased.
n
Solution: Let us first calculate E X and E ( X ) 2 as
1 x
E( X )
x f ( x )dx
x
2
exp
dx
x x dx
exp ye y dy (2)
0
0
1 x
2 2 2
E( X ) x f ( x )dx x exp dx
2
x2 x dx
2 exp 2 y 2 e y dy (3) 2 2
0
2
0
X ... X n E X 1 ... X n
Therefore, E (ˆ ) E 1
. So ˆ is an unbiased estimator
n n
for .
Thus the MSE of ˆ is equal to its variance, i.e.
X 1 ... X n
MSE E (ˆ ) 2 var(ˆ ) var
n
var X 1 ... var X n var( X
2
n n
2
E ( X ) ( E ( X )) 2 2 2 2 2
n n n
The Statistic S 2 : Recall that if X 1 ,..., X n come from a normal distribution with variance
(X i X )2
S2 i 1
n 1
(n 1) 2 2
It can be shown that S ~. n 1 . From the properties of 2 distribution, we have
n
(n 1) S 2
n 1 This E ( S )
2 2
E
2
And
(n 1) S 2 2 4
var( 2( n 1) This var( S 2
)
n 1
2
MSE (ˆ) E (ˆ ) 2
This measures the average (squared) distance between ˆ( X 1 , X 1 , ..., X n ) and as the data
~
X 1 , X 2 , ..., X n Varies according to F So if ˆ and are two estimators of , we say that ˆ
~ ~
is better than (in the mean-square error sense) if MSE (ˆ) MSE ( ) .
Next are some properties of the MSE. The first relates MSE to the variance and bias of an
estimator.
Proposition 2.2.1. MSE (ˆ) V (ˆ) b0 (ˆ) 2 . Consequently, if ˆ is an unbiased estimator
Proof. Let E (ˆ) . Then MSE (ˆ) E (ˆ ) 2 E (ˆ ) ( ) :
2
The first term is the variance of ˆ ; the second term is zero by definition of ; and the third
Terms is the squared bias.
Often the goal is to find estimators with small MSEs. From Proposition 2.1, this can be
Achieved by picking ˆ to have small variance and small squared bias. But it turns out that, in
general, making bias small increases the variance, and vice versa. This is what is called the
bias {variance trade off. In some cases, if minimizing MSE is the goal, it can be better to
allow a little bit of bias if it means a drastic decrease in the variance. In fact, many common
estimators are biased, at least partly because of this trade off.
Example 2.2.6. Let X 1 , X 2 , ..., X n be iid N ( , 2 ) and suppose the goal is to estimate 2 .
n
Define the statistic T ( X X ) 2 . Consider a class of estimators ˆ 2 aT where a is a
x 1
positive number. Reasonable choices of a include a (n 1) 1 and a n 1 . Let's find the
value of a minimizes the MSE.
1
First observe that 2 T is a chi-square random variable with degrees of freedom n - 1;
that E 2 (T ) (n 1) 2 and V 2 (T ) 2(n 1) 4 . Write R(a) for MSE 2 (aT ) .
4 2a 2 (n 1) a(n 1) 1
2
To minimize R (a), set the derivative equal to zero, and solve for a. That is,
0 R (a) a 4 4(n 1)a 2(n 1) 2 a 2(n 1) :
From here it's easy to see that a (n 1) 1 is the only solution (and this must be a minimum
n
Since R (a) is a quadratic). Therefore, among estimators of the form ˆ 2 a ( X X ) 2 the
x 1
n
one with smallest MSE is ˆ 2 (n 1) ( X i X ) 2 . Note that this estimator is not unbiased
x 1
since a (n 1) 1 . To put this another way, the classical estimator S 2 pays a price (larger
MSE) for being unbiased.
Proposition 2.2.3. If MSE (ˆn ) 0 as n , then ˆn is a consistent estimator of .
Proof. Fix 0 and note that P ˆ P (ˆ ) 2 2 . Applying Markov's
Inequality to the latter term gives an upper bound of 2 MSE (ˆn ) . Since this goes to zero
As we have already remarked, for any given parameter estimation problem, there are many
different possible choices for estimators. One desirable quality for an estimator is that it be
unbiased. However, this requirement alone does not impose a substantial condition, since (as
we have seen) there can exist several different unbiased estimators for a given parameter. In
the last two examples, we constructed two different unbiased estimators for the parameter θ,
given a random sample X 1 , X 2 ,..., X n from the uniform distribution on [0,θ]:
n 1 2
namely, ˆ1 max( x1 , x2 ,..., xn ) and ˆ2 ( x1 , x2 ,..., xn ) .
n n
Efficiency 2
Likewise, it is also not hard to see that, given a random sample X 1 , X 2 from the normal
1
distribution with mean θ and standard deviation σ, the estimators ˆ1 ( x1 x2 ) and
2
1
ˆ2 ( x1 2 x2 ) are also both unbiased. More generally, any estimator of the form
3
ax1 (1 a) x2 will be an unbiased estimator of the mean. We would now like to know if
there is a meaningful way to say one of these unbiased estimators is better than the other.
Efficiency, 3
In the abstract, it seems reasonable to say that an estimator with a smaller variance is better
than one with a larger variance, since a smaller variance would indicate that the value of the
estimator stays closer to the “true” parameter value more often. We formalize this as follows:
Definition 2.3.1: If ˆ1 and ˆ2 are two unbiased estimators for the parameter θ, we say that ˆ1 is
2 2
If var( x ) and var( x m ) , this implies var( x ), var( x m ) , we interpret that var(x ) is
n 2 m
var( x ) var(ˆ1 )
more efficiency if 1 . This means then;
var( xm ) var(ˆ2 )
Example 2.2.1 Let x1 , x2 ,..., x n are 4 observations taken from N ( , 2 ) population. Find the
1 1 n
efficiency of T ( x1 3 x2 2 x3 x4 ) relative to x xi .
7 4 i 1
1
[ 3 2 ]
7
1
.7
7
1
var(T ) [var( x1 ) 9 var( x2 ) 4 var( x3 ) var( x 4 )]
49
1 2
[ 9 2 4 2 2 ]
49
15
2
49
2
2 var( x ) 4 49 1
var( x ) . This implies that,
4 var(T ) 15 2
60
49
Example 2.3.2: Suppose that a random sample x, y is taken from the normal distribution with
mean θ and standard deviation σ.
1
1. Find the variance of the estimator 1 ( x y ) .
2
1
2. Find the variance of the estimator ˆ2 ( x 2 y ) .
3
3. Show that ˆ1 and ˆ2 are both unbiased estimators of θ.
5. More generally, for ˆa ax (1 a) y , which value of a produces the most efficient
estimator?
Solution
To compute the variances and check for unbiasedness, we will use properties of expected
value and the additivity of variance for independent variables.
Suppose that a random sample x, y is taken from the normal distribution with mean θ and
standard deviation σ.
1
1. Find the variance of the estimator ˆ1 ( x y ) .
2
Note that because x and y are independent, their variances are additive, and
var( X ) var(Y ) 2 .Then, we have
1 1
var(ˆ1 ) var( X Y )
2 2
1 1
var( X ) var( Y )
2 2
1 1
var( X ) var(Y )
4 4
1 1
2 2
4 4
1
2
2
1 2
2. Thus var(ˆ2 ) var( X Y )
3 3
1 2
var ( X ) var (Y )
3 3
5 2
9
5 2
Thus var(ˆ2 )
9
3. Show that ˆ1 and ˆ2 are both unbiased estimators of θ.
1 1 1
We have E (ˆ1 ) E ( ( x y )) E ( x) E ( y ) .
2 2 2
1 1 2
Likewise, E (ˆ2 ) E ( ( x 2 y )) E ( x) E ( y ) .
3 3 3
Thus, both estimators are unbiased.
4. Which of ˆ1 , ˆ2 is a more efficient estimator of θ?
1 5 1 5
Since var( ˆ1 ) = 2 while var( ˆ2 ) = 2 , we see ˆ1 is more efficient since .
2 9 2 9
Example 2.3.3: Suppose that a random sample x, y is taken from the normal distribution with
mean θ and standard deviation σ.
5. More generally, for ˆa ax (1 a) y , which value of a produces the most efficient
estimator?
In the same way as before, we can compute
var(ˆa ) var(ax (1 a) y )
a 2 var( x) (1 a) 2 var( y )
a 2 (1 a ) 2 (2a 2 2a 1) 2
1
By calculus (the derivative is 4a−2 which is zero for a ) or by completing the square
2
(2a 2 2a 1 2(a 1 ) 2 1 ) we can see that the minimum of the quadratic occurs when
2 2
1
a .Thus, in fact, ˆ1 is the most efficient estimator of this form.
2
1
(Answer: a .)Intuitively, this last calculation should make sense, because if we put more
2
weight on one observation, its variation will tend to dominate the calculation. In the extreme
situation of taking ˆ3 x3 (which corresponds to a= 0), for example, we see that the variance
is simply 2 , which is much larger than the variance arising from the average. This is quite
1
reasonable, since the average ( x1 x2 ) uses a bigger sample and thus captures more
2
information than just using a single observation.
Example2.3.4 A random sample X 1 , X 2 ,..., X n is taken from the uniform distribution on
[0,θ].
1. Find the variance of the unbiased estimator
n 1
ˆ1 max( x1 ,..., x n ) .
n
2
2. Find the variance of the unbiased estimator 2 ( x1 x2 ... xn )
n
3. Which estimator is a more efficient estimator for θ?
Solution
n 1
nx
To compute the variance of 1 , we use the pdf of max ( x1 ,..., xn ) , which is g ( x) n for
0 x .
Then E max( x1 ,..., xn ) x . 2
2 nx n1
n
dx
n
n2
2 . Also,
0
2
E max( x1 ,..., x n ) x.
nx n1
n
dx
n 2
n 1
,
0
2
n n n
so var[max(x1 ,..., xn )] 2 2.
n2 n 1 (n 2)(n 1) 2
Therefore,
2
n 1 1
var(ˆ1 ) var[max(x1 ,..., xn )] 2
n n(n 10
2. For ˆ2 since the x are independent, their variances are additive.
1 2
Hence, var( xi ) x. dx .
0
12
2 4 2
Thus, var(2 ) var( x1 ) ... var( xn ) n. 2 .
nx n . 12
1 1
3. We just calculated var(ˆ1 ) 2 and var(ˆ2 ) 2 .
n(n 2) 3n
For n = 1 these variances are the same (this is unsurprising because when n = 1 the
estimators themselves are the same!).
1 1
For n > 1 we see that the variance of ˆ1 is smaller since , so ˆ1 is more efficient.
n2 3
Cramer-Rao lower Bound
The variance of any estimator is always bounded below (since it is by definition nonnegative). So it is
quite reasonable to ask whether, for a fixed estimation problem, there might be an optimal unbiased
estimator: namely, one of minimal variance.
This question turns out to be quite subtle, because we are not guaranteed that such an estimator
necessarily exists. For example, it could be the case that the possible variances of unbiased estimators
form an open interval of the form (a, ) for some a 0 .
Then there would be estimators whose variances approach the value a arbitrarily closely, but there is
none that actually achieves the lower bound value a.
There is a lower bound on the possible values for the variance of an unbiased estimator:
Theorem (Cramer’-Rao Inequality):
Suppose that PX ( x; ) is a probability density function that is differentiable in . Also suppose that
the support of p, the set of values of x where PX ( x; ) 0 , does not depend on the parameter . If
2
l ( ) n.E .
In the event that PX pX is twice-differentiable in , it can be shown that l ( ) I (_) can also be
2
calculated as l ( ) n.E .
A few remarks:
The proof of the Cramer-Rao inequality is rather technical (although not conceptually difficult),
so we will omit the precise details.
In practice, it is not always so easy to evaluate the lower bound in the Cramer-Rao inequality.
Furthermore, there does not always exist an unbiased estimator that actually achieves the Cramer-
Rao bound.
However, if we are able to find an unbiased estimator whose variance does achieve the Cramer-
Rao bound, then the inequality guarantees that this estimator is the most efficient Possible.
Example 2.3.5 Suppose that a coin with unknown probability of landing heads is flipped n times,
yielding results X 1 ,..., X n (where we interpret heads as 1 and tails as 0).
1
Let ˆ ( x1 ... x n ) .
n
1. Show that ˆ is an unbiased estimator of .
2. Find the variance of ˆ .
3. Show that ˆ has the minimum variance of all possible unbiased estimators of .
Solution
We first need to compute the expected value and variance of this estimator. Then we need to evaluate
the lower bound in the Cramer-Rao inequality.
1. The claim is that the given estimator actually achieves this lower bound.
Since X 1 X 2 ... X n is binomially distributed with parameters n and , its expected value is
1
n . Then E ( ˆ ) = n , so ˆ is unbiased.
n
2. Since X 1 X 2 ... X n is binomially distributed with parameters n and , its variance
is n (1 ) .
1
Then the variance of ˆ ( x1 ... x n ) is
n
1 (1 )
var(ˆ) 2
.n (1 ) .
n n
3. We compute the Cramer-Rao bound: if lnPX ( X ; ) is the log-pdf of the distribution, then
1 2
ˆ
var( ) where l ( ) nE . Here, the likelihood function can be written as
l ( )
L( X ; ) x (1 )1 x (it is if x = 1 and 1 - if x = 0), so that x ln (1 x) ln(1 ) .
2 X 1 X
Differentiating twice yields 2 .
2
(1 ) 2
So, since E(x) = , the expected value is
2 E ( X ) E (1 X )
E 2
2 (1 ) 2
1
(1 )
(1 )
From above the Cramer-Rao bound is var(ˆ) .
n
1
But we calculated before that for our unbiased estimator ˆ ( x1 ... x n ) , we do in fact
n
(1 )
have var(ˆ) .
n
Therefore, our estimator ˆ achieves the Cramer-Rao bound, so it has the minimum variance of all
possible unbiased estimators of , as claimed.
So, in fact, the obvious estimator is actually the best possible!
1
Example2.3.6: Show that the maximum-likelihood estimator ˆ ( x1 ... x n ) is the most
n
efficient possible unbiased estimator of the mean of a normal distribution with unknown
mean and known standard deviation .
Solution
We will show that this estimator achieves the Cramer-Rao bound.
1
For this, we first compute the log-pdf ln 2 ln( )
2
( X )2
2 2
.
X 2 1
Differentiating yields and then 2 2 .
2
2
Then, the Cramer-Rao bound dictates that var(ˆ) for any estimator ˆ .
n
2
The Cramer-Rao bound says var(ˆ) .
n
For our estimator, since the X i are all independent and normally distributed with mean and
1 1 2
standard deviation , we have var(ˆ ) var( X 1 ) ... var( X n
) .n 2
.
n2 n2 n
Thus, the variance of our estimator ˆ achieves the Cramer-Rao bound, meaning that it is
Definition 2.3.2: The efficiency of an unbiased estimator T for θ is defined to be the ratio of
Var(T) and Cramer-Rao Lower Bound (CRLB), it is denoted by e(T) and is given by
, if the probability function is smooth. Note that for UMVUE the efficiency is
one.
Example 2.3.1 Consider a random sample from Normal distribution with mean and
variance , Note that is unbiased for and that is the unbiased estimator for .
, and
Definition 2.3.3 (relative efficiency) Let T1 and T2 be two unbiased estimators of θ with
variances Var(T1) and(T2) respectively.
(ii) The relative efficiency of T1 with respect to T2 denoted by RE (T1,T2) is defined as a ratio
ratio .
Example 2.2.2 Let X1, X2, X3… Xn be a random sample of size n from uniform on the
interval ( 0, θ). Also let T1 = (n+1)Y1 and , were Y1 and Yn are
E(T2) = θ, and
Summary
2.4 SUFFICIENCY
Introduction
In the lesson on Point Estimation, we derived estimators of various parameters using two
methods, namely, the method of maximum likelihood and the method of moments. The
estimators resulting from these two methods are typically intuitive estimators. It makes sense,
for example, that we would want to use the sample mean ˆ and sample variance 2 to
estimate the mean and variance 2 of a normal population. In this lesson, we'll learn how
to find statistics that summarize all of the information in a sample about the desired
parameter. Such statistics are called sufficient statistics, and hence the name of this lesson.
Definition2.4.1: Let X 1 , ... , X n be a random sample from a probability distribution with
the conditional distribution of X 1 , ..., X n , given the statistic Y , does not depend on the
Example 2.4.1
If p is the probability that subject likes Pepsi, for i 1,2,3,.....n , then p probability with
n
probability Suppose, in a random sample of n = 40 people, that Y X i people like Pepsi.
i 1
If we know the value of Y , the number of successes in n trials, can we gain any further
information about the parameter p by considering other functions of the data X 1 , ... , X n ? That
is, is Y sufficient for p?
Solution
The definition of sufficiency tells us that if the conditional distribution of X 1 , ... , X n , given
the statistic Y , does not depend on p , then Y is a sufficient statistic for X 1 , ... , X n . The
Now, for the sake of concreteness, suppose we were to observe a random sample of size n in
which x1 1, x2 0 , and x3 1 . In this case: P( X 1 1, X 2 0 , X 3 1, Y 1) 0 because the
n
sum of the data values, X
x 1
i , is 1 + 0 + 1 = 2, but Y , which is defined to be the sum of the
X i 's is 1. That is, because 2 1 , the event in the numerator of the starred (**) equation is an
impossible event and therefore its probability is 0.
n
So, in general: P( X 1 x1 , ...., X n x n ) 0 if x
x 1
i y and:
n
P ( X 1 x1 , .... , X n x nY y ) P y (1 p ) n y if x
x 1
i y
Now, the denominator in the starred (**) equation above is the binomial probability of
getting exactly successes in trials with a probability of success
n
. That is, the denominator is: P(Y y ) P y (1 p ) n y for y = 1, 2, …,n. Putting the
y
numerator and denominator together, we get, if , that the conditional probability is:
P y (1 P) n y 1
P( X 1 x1 , ...., X n xnY y ) ,
n y n y n
P (1 p)
y y
n n
if xi y and: P( X 1 x1 , ...., X n xn ) 0 if
x 1
xx 1
i y
We have just shown that the conditional distribution of X 1 , ... , X n given Y does not depend on
P. Therefore, Y is indeed sufficient for P. That is, once the value of Y is known, no other
function of X i s will provide any additional information about the possible value of P.
Factorization Theorem 2.4.1
While the definition of sufficiency provided may make sense intuitively, it is not always all
that easy to find the conditional distribution of X 1 , ... , X n given Y. We can mention that we'd
have to find the conditional distribution given that we'd want to consider a possible sufficient
statistic! Therefore, using the formal definition of sufficiency as a way of identifying a
sufficient statistic for a parameter can often be a daunting road to follow. Therefore, a
theorem often referred to as the Factorization Theorem provides an easier method! We state it
here without proof.
Factorization Theorem: Let X 1 , ... , X n denote random variables with joint probability density
function or joint probability mass function f ( x1 , x2 ,..., x n ; ) , which depends on the parameter
. Then, the statistic Y u ( X 1 ,...., X n ) is sufficient for if and only if the p.d.f (or p.m.f.)
can be factored into two components, that is:
f ( x1 ,..., xn ; ) u ( x1 ,..., xn ); ; h( x1 ,..., x n ) where:
is a function that depends on the data only through the function , and
the function h( x1 ,..., xn ) does not depend on the parameter
Example 2.4.2
Let X 1 , ... , X n denote a random sample from a Poisson distribution with parameter 0 .
Solution
Because X 1 , ... , X n is a random sample, the joint probability mass function of X 1 , ... , X n is,
by independence:
f ( x1 ,..., xn ; ) f ( x1 ; ) xf ( x2 ; ) x.....xf ( xn ; )
Inserting what we know to be the probability mass function of a Poisson random variable
with parameter , the joint p.m.f. is therefore:
e x1 e x2 e xn
f ( x1 ,..., xn ; ) x x .....x
x1! x2 ! xn !
Now, simplifying, by adding up all n of the s in the exponents, as well as all of the
1
f ( x1 ,..., xn ; ) (e n i ) x
x
x1! x2 !...x n !
We just factored the joint p.m.f. into two functions, one ( ) being only a function of the
n
statistic Y X i and the other (h) not depending on the parameter :
x 1
n
Therefore, the Factorization Theorem tells us that Y X i is a sufficient statistic for . We
x 1
1
f ( x1 ,..., xn ; ) (e n nx ) x
x1! x2 !...xn !
Therefore, the Factorization Theorem tells us that Y Xˆ is also a sufficient statistic for !
n
If you think about it, it makes sense that Y Xˆ and Y X i are both sufficient statistics,
x 1
n n
because if we know Y Xˆ , we can easily find Y X i . And, if we know Y X i , we
x 1 x 1
Example 2.4.3 Let X 1 , ... , X n be a random sample from a normal distribution with mean
Solution
Because X 1 , ... , X n is a random sample, the joint probability density function of X 1 , ... , X n is,
by independence:
f ( x1 ,..., xn ; ) f ( x1 ; ) xf ( x2 ; ) x ... xf ( xn ; )
Inserting what we know to be the probability density function of a normal random variable
with mean and variance 1, the joint p.d.f. is:
1 1 1 ( x2 ) 2 1 1
f ( x1 , x2 , ... , xn ; ) 1
exp ( x1 ) 2 x 1
exp x....x 1
exp ( xn ) 2
(2 ) 2 2 (2 ) 2 2 (2 ) 2 2
Collecting like terms, we get:
1 1 n
f ( x1 , x 2 , ... , x n ; ) n
exp ( xi ) 2
( 2 ) 2 2 i 1
A trick to making the factoring of the joint p.d.f. an easier task is to add 0 to the quantity in
parentheses in the summation. That is:
1 1 n
1 n
f ( x1 , x 2 , ... , x n ; ) n
exp [( xi x ) 2 ( x ) ( xi x ) ( x ) 2 ]
(2 ) 2
2 i 1 2 i 1
But, the middle term in the exponent is 0, and the last term, because it doesn't depend on the
index i, can be added up n times:
n 1 1 n
f ( x1 , x 2 , ... , x n ; ) exp ( x ) 2 x
2
n
( 2 ) 2
exp
2 i 1
( xi ) 2
In summary, we have factored the joint p.d.f. into two functions, one ( ) being only a
function of the statistic and the other (h) not depending on the parameter :
Therefore, the Factorization Theorem tells us that Y Xˆ is a sufficient statistic for . Now,
Y X̂ 3 is also sufficient for , because if we are given the value of X̂ 3 , we can easily get
1
the value of X̂ through the one-to-one function w y 3 . That is:
1
W (x3 ) 3
Xˆ
On the other hand, Y X̂ 2 is not a sufficient statistic for , because it is not a one-to-one
function. That is, if we are given the value of X̂ 2 , using the inverse function:
1
w y 2
We're getting so good at this, let's take a look at one more example!
Example 2.4.4
Let X 1 , ... , X n be a random sample from an exponential distribution with parameter . Find a
Solution
Because X 1 , ... , X n is a random sample, the joint probability density function of X 1 , ... , X n s
is, by independence:
f ( x1 ,..., xn ; ) f ( x1 ; ) xf ( x 2 ; ) x ... xf ( xn ; )
1 x 1 x 1 x
f ( x1 , x 2 , ... , x n ; ) exp i x exp 2 x....x exp n
Now, simplifying, by adding up all n of the s and then xi 's in the exponents, we get:
1 1 n
f ( x1 , x 2 , ... , x n ; ) exp xi
i 1
n
We have again factored the joint p.d.f. into two functions, one ( ) being only a function of
n
the statistic Y X i and the other (h) not depending on the parameter :
i 1
n
Therefore, the Factorization Theorem tells us that Y X i is a sufficient statistic for .
i 1
n
And, since is a one-to-one function of Y X i , it implies that Y Xˆ is also a sufficient
i 1
statistic for .
Exponential Form
You might not have noticed that in all of the examples we have considered so far in this
lesson, every p.d.f. or p.m.f. could we written in what is often called exponential form, that
is:
First, we had Bernoulli random variables with p.m.f. written in exponential form as:
With:
f ( x; p) p x (1 p)1 x
ln f ( x; p) ln p x (1 p)1 x
This becomes:
f ( x; p) exp ln p x (1 p)1 x using laws of logarithms, you get
p
f ( x; p) exp x ln ln(1) ln(1 p)
1 p
Example 2.4.5
Write the Poisson random variables in exponential form whose p.m.f. is given as:
e x
f ( x; ) :
x!
with
We can write the normal random variables whose p.d.f. can be written in exponential form as:
1. K(x) and S(x) being functions only of x ,
2. p ( ) and g ( ) being functions only of the parameter .
3. The support x being free of the parameter
Example 2.4.6
Then, we have exponential random variables random variables whose p.d.f. can be written in
exponential form as:
with
Writing p.d.f.s and p.m.f.s in exponential form provides you a third way of identifying
sufficient statistics for our parameters.
Let X 1 , X 2 ,..., X n be a random sample from a distribution with a p.d.f. or p.m.f. of the
exponential form: f ( x; ) exK ( x) p( ) S ( x) q( ) , with a support that does not depend
n
on . Then, the statistic: K(X
i 1
i ) is sufficient for .
Exercise
Let X 1 , X 2 ,..., X n be a random sample from a geometric distribution with parameter p. Find a
(ii) The order statistics form a joint sufficient statistics for the parameter for any
distribution
(iii) A set of statistics (T1 , T2 , ….Tk) is called minimal sufficient statistics if the members
of the set are joint sufficient statistics for the parameter and if they are a function of every
other set of joint sufficient statistics (Rao-Blackwell theorem)
Confidence intervals
Confidence intervals (CIs) provide a method of quantifying uncertainty to an
estimator ˆ when we wish to estimate an unknown parameter . We want to find an interval
(A, B) that we think has high probability of containing .
Definition: Suppose that X n ( x1 ,..., x n ) is a random sample from a distribution
X n ~. N ( ;
2
):
n
The standardized version of the sample mean follows N (0; 1) and can therefore act
as a pivot. In other words, construct,
X n ( X )
( X n , ) ~. N (0,1)
n
For every value of . With z denoting the upper -th quantile of N (0; 1) (i.e., P (Z > z ) =
n(X )
P z 1
2
From the above display we can find limits for such that the above inequalities are
Simultaneously satisfied. On doing the algebra, we get:
P X z X z 1
n 2 n 2
Thus our level ( 1 ) CI for is given by
X z , X z
n 2 n 2
Often a standard method of constructing CIs is the following method of pivots which we
describe below.
(1) Construct a function using the data X n and g ( ) , say ( X n , g ( )) , such that the
distribution of this random variable under parameter value does not depend on and is
known. Such a is called a pivot.
(2) Let G denote the distribution function of the pivot. The idea now is to get a range of
plausible values of the pivot. The level of confidence 1 is to be used to get the
appropriate range.
This can be done in a variety of ways but the following is standard. Denote by q (G; ) The
However, one can replace by s, where s2 is the natural estimate of 2 introduced before.
So, set:
(X )
( X n , )
s
This only depends on the data and g( ) = . We claim that this is indeed a pivot.
To see this write
n(X )
(X )
s s 2
2
The numerator on the extreme right of the above display follows N(0; 1) and the denominator
is independent of the numerator and is the square root of a n21 random variable over its
Thus, G here is the t n 1 distribution and we can choose the quartiles to be q(t n 1 )
2
q(t n 1 ,1 ) q(t n 1 ,1 ) .
2 2
n(X )
It follows that, P , 2 q (t n 1,1 ) q (t n1,1 1
2 s 2
As with Example 1, direct algebraic manipulations show that this is the same as the
Statement:
s s
P , 2 X q(t n 1,1 ) X q(t n1,1 ) ) 1
n 2 n 2
E ( X 1 ) and var( X 1 ) 2 :
We are interested in constructing an approximate level ( 1 ) CI for .
n(X )
~.appx N (0,1) .
If is known the above quantity is an approximate pivot and following Example 1, we can
therefore write,
n(X )
P , z z1 1
2 2
As before, this translates to
P X q ( z ) X z ) ) 1
n 2 n 2
This gives an approximate level ( 1 ) CI for when is known.
The approximation will improve as the sample size n increases. Note that the true coverage of
the above CI may be different from ( 1 ) and can depend heavily on the nature of F and the
sample size n.
Realistically however is unknown and is replaced by s. Since we are dealing with large
sample sizes, s is with very high probability close to and the interval
s s
X z , X z ,
n 2 n 2
Still remains an approximate level ( 1 ) CI.
Interpretation of confidence intervals: Let (A;B) be a coefficient confidence interval for
a parameter . Let (a; b) be the observed value of the interval. It is NOT correct to say that
lies in the interval (a; b) with probability . It is true that will lie in the random intervals
having endpoints A ( X 1 ,..., X n ) and B ( X 1 ,..., X n ) with probability .
After observing the specific values A ( X 1 ,..., X n ) = a and B ( X 1 ,..., X n ) = b, it is not possible
to assign a probability to the event that lies in the specific interval (a; b) without regarding
as a random variable. We usually say that there is confidence that lies in the interval
(a; b).