[go: up one dir, main page]

0% found this document useful (0 votes)
6 views41 pages

Unit 2

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 41

2.

0 Properties of estimators

Properties of estimators are all consequences of their sampling distributions. Most of the
time, the full sampling distribution of ˆ is not available; therefore, we focus on properties
that do not require complete knowledge of the sampling distribution.

2.1 Unbiasedness

The first, and probably the simplest, property is called unbiasedness. In words, an estimator
ˆ is unbiased if, when applied to many different samples from F , ˆ equals the true

parameter θ, on average. Equivalently, unbiasedness means the sampling distribution of ˆ is,


in some sense, cantered around θ.

Definition 2.1.1. Let X 1 , X 2 , ..., X n be a random sample of size n from . An

estimator T of is said to be unbiased estimator of if and only if (iif) E (T) = for all
values of in , where , is the parameter space.

That is, no matter the actual value of θ, if we apply ˆ = ( X 1 , X 2 , ..., X n ) to many datasets

X 1 , X 2 , ..., X n sampled from F , then the average of these ˆ values will equal θ—in other

words, E ( ˆ ) = θ for all θ. This is clearly not an unreasonable property, and a lot of work in
mathematical statistics has focused on unbiased estimation.

Example 2.1.1. Let X 1 , X 2 , ..., X n be iid from some sample size n of normal distribution

having mean μ and variance  2 . This distribution could be normal, but it need not be.
Consider ̂  X , the sample mean, and ˆ 2  S 2 , the sample variance. Then ̂ and ˆ 2 are

unbiased estimators of μ and  2 , respectively.

The mle for μ is and the mle for is . and ,

therefore is an unbiased estimator for μ. While , is a biased estimator of . Note

that is an unbiased estimator for .

Proof

Proof: Let X 1 , ..., X n be independent selected random observation drawn from a population

with mean  and variance  2 . Let


 x  ...  xn 
E( X )  E  1 
 n 
1 
 E   x1  ...  xn 
n 
1
 E[( x1  ...  xn ]
n
1
 E ( x1 )  ...  E ( xn 
n
1
 [   ....   ]
n
1
E ( X )  .n  
n

To show that sample variance is an unbiased estimator of the population variance i,e
E(S 2 )   2

Let X 1 , ..., X n be n independent observations from a population with mean  and

variance  2 .

Proof

We need to use the following relations:

E( X )   , Var(X)=  2

E( X )   E( X )

E (cX )  cE ( X )

Var ( X )  E ( X 2 )  E ( X )
2

E( X 2 )   2   2

Var ( X )  E ( X 2 )  [ E ( X )]2

2
E( X 2 )   2
n

What we want is to show that E ( S 2 )   2 to do this we set


  ( X  X )2 
E(S 2 )  E   So we take
 n 1 


 E  ( X  X )2 
 E  ( X  2 XX  X 
2 2

2
 E  X  2 x  X   X 
2 2
We know E ( X 2 )    2 and
n
 E  x  nX  2 2

  E ( X 2 )  E (nX 2 )

E( X 2 )   2   2

2
  (   )  n(
2 2
 2)
n

 n 2  n 2   2  n 2

So we write

  ( X  X )2 
E(S )  E 
2

 n 1 
E[  ( X  X ) 2

n 1
1(n  n   2  n 2
2

n 1
(n  1) 2

n 1
E(S )  
2 2

Therefore, the sample variance is an unbiased estimator of the population variance, regardless
of the model.

While unbiasedness is a nice property for an estimator to have, it doesn’t carry too much
weight. Specifically, an estimator can be unbiased but otherwise very poor. For an extreme
1
   
example, suppose that P ˆ    10 5   P ˆ  10 5 . In this case, E ˆ   , but ˆ is
2

always very far away from θ. There is also a well-known phenomenon (bias–variance trade-
off) which says that often allowing the bias to be non-zero will improve on estimation
accuracy; more on this below. The following example highlights some of the problems of
focusing primarily on the unbiasedness property.
Example 2.1.2. Let X be a sample from a Pois (θ) distribution. Suppose the goal is to
estimate   e 2 , not θ itself. We know that ˆ =X is an unbiased estimator of θ. However,

the natural estimator e 2 X is not an unbiased estimator of e 2 .Consider instead ˆ  (1) X .
This estimator is unbiased:

e  (1) X  x ( ) x
 
 
E (1) X    e    e  e   e 2 . In fact, it can even be shown that
x 0 x! x 0 x!

(1) X is the “best” of all unbiased estimators; cf. the Lehmann–Scheffe theorem. But even -

though it’s unbiased, it can only take values ±1 so, depending on θ, (1) X may never be close

to e 2 .

1
Example2.1.3 If X is a Bernoulli random variable with parameter p, then: X 
n
 X Is the
maximum likelihood estimator (MLE) of p. is the MLE of an unbiased estimator of p?

Solution: Recall that if X is a Bernoulli random variable with parameter p, then E ( X )  p .

1  1 1 1
Therefore: E ( pˆ )  E   X    E ( X )   p  (np )  p
n  n n n

The first equality holds because we've merely replaced p̂ with its definition. The second
equality holds by the rules of expectation for a linear combination. The third equality holds
because E ( X )  p . The fourth equality holds because when you add the value p up n times,
you get np . And, of course, the last equality is simple algebra.

In summary, we have shown that: E ( pˆ )  p . Therefore p̂ , is the maximum likelihood


estimator is an unbiased estimator of p.

Example 2.1.4 .Let X 1 , X 2 , ..., X n be iid Ber(θ) and suppose we want to estimate


 ,the so-called odds ratio. Suppose ̂ is an unbiased estimator of η, so that
1

E (ˆ )    for all θ or, equivalently, (1   ) E (ˆ )    0 (1−θ) for all θ.
1
Here the joint PMF of ( X , ... , , X n ) is f  ( x1 , ... , xn )   x1  x2  ... xn (1   ) n ( x1  ,... xn ) . Writing out

E (ˆ ) as a weighted average with weights given by f  ( x1 , , , , , xn ) , we get


(1   ) ˆ ( x )
all x1 ,..., xn
i
x1   xn
(1   ) n( x1 ... xn )    0 for all θ. The quantity on the left-hand side is a

polynomial in θ of degree n+1. From the Fundamental Theorem of Algebra, there can be at
most n+ 1 real roots of the above equation. However, unbiasedness requires that there be
infinitely many roots. This contradicts the fundamental theorem, so we must conclude that
there are no unbiased estimators of η.

Another useful though perhaps crude measure of closeness of an estimator is the mean-square
error (MSE) of the estimator.

Definition 2.1.2: Let T be an estimator of , then the mean-square error of T denoted by


is given by , it is a function of .

Remark ,

, where bias of
. Thus if an estimator is unbiased then its mean-squared error is
equal to its variance.

Example 2.1.5:

a) Suppose X ~ U(0 , ), recall that and W = are

two estimators of with T being biased and W unbiased.

………(*)

…………(**), since W is

unbiased.

Using the pdf of ,

Substituting the expression for Var(T) in (*) and (**) after simplification we get
Clearly , for

any n greater than one and are equal for n=1. Therefore W is better than T as an estimator of
in the MSE sense.

2.2 Consistency

Another reasonable property is that the estimator ˆ  ˆn ˆ, which depends on the sample size

n through the dependence on X 1 , X 2 , ..., X n , should get close to the true θ as n gets larger and

larger. An estimator ˆ or ˆn ( x) is consistent for a parameter  if the probability of its


sampling error of any magnitude converges to 0 as the sample size n increases to infinity.

 
P ˆ      0 as n   for   0

The Usual Setup is that we are interested in the value of some parameter θ that describes a
feature of a population. We draw a random sample from the population, X 1 , X 2 , ..., X n , and

have an estimator which is a function of the sample: ˆ  ˆ( X 1 , X 2 ,..., X n ) .

The idea is that we’d like ˆ to get “closer” and closer to θ as we draw larger and larger
samples.

Definition 2.2.1: Statistical Consistency

An estimator ˆn is said to be a consistent estimator of θ if, for any positive 

lim P ( ˆn     )  1 , or, equivalently,


n 

lim P(  n     )  0
n 

We say that ˆn converges in probability to θ and we write ˆn  p 0 .

2
The sample mean is consistent for  since var( X )  and
n
2

P X     var( X )
 2

 2
n  0 as n  

Where the inequality is from Chebyshev’s inequality


To make this precise, recall the following definition (see Definition 2.2.1)

Definition 2.2.1 Let T and { Tn : n  1 } be random variables in a common sample space. Then

Tn converges to T in probability if, for any ε >0, lim PTn  T     0 . The law of large
n

numbers (Law of Large Numbers, LLN)). If X 1 , X 2 , ..., X n are iid with mean μ and finite
n
variance  2 , then ̄ X n  n 1  X i converges in probability to μ. The LLN is a powerful result
i 1

and will be used throughout the course. Two useful tools for proving convergence in
probability are the inequalities of Markov and Chebyshev’s.

 Markov’s inequality. Let X be a positive random variable, i.e., P(X >0) = 1. Then, for
any ε >0, P(X > ε) ≤  1 E(X).
 Chebyshev’s inequality. Let X be a random variable with mean μ and variance
σ2.Then, for any ε >0, P{|X−μ|> ε}≤  2 2 .
It is through convergence in probability that we can say that an estimator ˆ  ˆn gets
close to the estimand θ as n gets large.
Definition 2.2.2. An estimator ˆn of θ is consistent if ˆn   in probability. A rough way to

understand consistency of an estimator ˆn of θ is that the sampling distribution of ˆn gets
more and more concentrated as n→ ∞. The following example demonstrates both a
theoretical verification of consistency and a visual confirmation via Monte Carlo.
Y
Example2.2.1: Suppose Y ~. Bin(n, ) where  the probability of success is. Show that ˆ  is
n
a consistent estimator of the population parameter  .

Solution. We first show that ˆ is an unbiased estimator of  . Since

 Y  1 1
E ˆ  E    E (Y )  n   . Therefore, ˆ is an unbiased estimator of  .
n n n

The variance of the estimator is:

Y  1 1  (1   )
var(ˆ)  var   2 var(Y )  2 n (1   ) 
n n n n

Thus the standard error is


 (1   )
SE (ˆ)  var(ˆ) 
n

 (1   )
By Chebyshev’s inequality SE (ˆ)  var(ˆ)  tends to zero as n  
n

Hence, the estimator is consistent for  .

Example 2.2.2. Recall the setup of Example 2.3. It follows immediately from the LLN that
̂ n  X is a consistent estimator of the mean μ. Moreover, the sample variance ˆ n2  S 2 is

also a consistent estimator of the variance  2 . To see this, recall that


n 1 n 2 2 n
ˆ n2    x  X  .The factor converges to 1; the first term in the braces
n  1  n x 1  n 1

convergence in probability to  2   2 by the LLN applied to the X i2 s ; the second term in the

braces converges in probability to  2 by the LLN (see, also, the Continuous Mapping

Theorem below). Putting everything together, we find that ˆ n2   2 in probability, making it


a consistent estimator. To see this property visually, suppose that the sample originates from
a Poisson distribution with mean θ= 3. You can modify the algorithm to simulate the
sampling distribution of ˆn  ˆ n2 for any n. The results for n∈ {4, 16, 54} are summarized in
Figure 1. Notice that as n increases, the sampling distributions become more concentrated
around θ = 3. Unbiased estimators generally are not invariant under transformations [i.e., in
general, if ˆ is unbiased for θ, then g( ˆ ) is not unbiased for g(θ)], but consistent estimators
do have such a property, a consequence of the so-called Continuous Mapping Theorem.
Theorem 2.2.1 (Continuous Mapping Theorem). Let g be a continuous function on Θ. If ˆn

is consistent for θ, then g( ˆn ) is consistent for g(θ).


Proof. Fix a particular θ value. Since g is a continuous function on Θ, it’s continuous at this
particular θ. For any ε >0, there exists a δ >0 (depending on ε and θ) such that |g( ˆn ) - g(θ)| >

ε implies| ˆn −θ | > δ. Then the probability of the event on the left is no more than the
probability of the event on the right, and this latter probability vanishes as n→∞ by
assumption. Therefore

 
lim P g (ˆ)  g ( )    0 .
n

Since ε was arbitrary, the proof is complete.


Figure 1: Plots of the sampling distribution of ˆn , the sample variance, for several values of n
in the Pois (θ) problem with θ= 3.
Example 2.2.3. Let X 1 , X 2 , ..., X n be iid Pois (θ). Since θ is both the mean and the variance

for the Poisson distribution, it follows that both ˆn  X and ̃  n  S 2 are unbiased and
consistent for θ. Another comparison of these two estimators is by considering a new


estimator ˆn  X S 2  1
2
1
. Define the function g ( x1 , x2 )  ( x1 x2 ) 2 . Clearly g is continuous
~
(why?). Since the pair (ˆn , ) is a consistent estimator of (θ, θ), it follows from the
~
continuous mapping theorem that ̇ ˆn  g (ˆn ) is a consistent estimator of   g ( , ) . Like
with unbiasedness, consistency is a nice property for an estimator to have. But consistency
alone is not enough to make an estimator a good one.
Example 2.2.4. Let X 1 , X 2 , ..., X n be iid N(θ,1). Consider the estimator

ˆ 10 7 if n  10 750
n  
 X Otherwise
Show that ˆn is a consistency estimator
Solution
Let N= 10 750 . Although N is very large, it’s ultimately finite and can have no effect on the
limit. To see this, fix ε >0 and define


a n  P ˆn       
and bn  P X n     .Since bn→0 by the LLN, and a n  bn for all

n ≥ N, it follows that a n  0 and, hence, ˆn is consistent. However, for any reasonable

application, where the sample size is finite, estimating θ by a constant 10 7 is an absurd


choice.
Definition 2.2.3 An estimator T is said to be consistent for if it converges in probability to
, that is if for all and , there exist N (which depends on both and ) such
that for all n > N.

If T is unbiased estimator of , and Chebycheff’s inequality we have

, this suggests that for an unbiased estimator one with smaller

variance will tend to be more concentrated about the parameter and therefore preferable. In
some cases one unbiased estimator may have smaller variance for some values of . In certain
cases it is possible to show that a particular unbiased estimator has the smallest possible
variance among all possible unbiased estimators for all values of .

Definition 2.2.4 An estimator T* of is called uniformly minimum variance unbiased


estimator (UMVUE) of if

(i) T* is unbiased

(ii) Var(T*) < Var(T) for any other unbiased estimator T of for all , where is the

Parameter space for .

There are two questions one may ask:

(1) How do we know that an estimator is UMVUE?

(2) How do we find the UMVUE?

The next theorem gives a partial answer to the first question, it gives a lower bound for the
variance of an unbiased estimator for certain probability functions. There is no answer to the
second question.

Let X 1 ,..., X n be a random sample of size n from and T be an unbiased estimator of

Then under smoothness assumption on

For continuous distributions

Summation is with respect to x for discrete distributions,

then = CRLB
Note:

Example 2.2.5

a) Consider a random sample from an exponential distribution with mean θ. Find the CRLB

, and , therefore the

CRLB = , the mle for is , E( ) = θ, and var( ) = , so the mle of θ is

unbiased and its variance is equal to the CRLB, therefore is UMVUE for θ

b) Let us consider a random sample from Bernoulli distribution, find the CRLB for P.

, the usual estimator for P is the sample

proportion which is given by = the number of successes.

, thus sample proportion

is unbiased estimator of P and its variance is .

Taking the expectation and simplifying we get

, therefore the CRLB for P is which

the variance of the sample proportion, therefore the sample proportion is UMVUE for

P.

c) Consider a random sample from Normal distribution with mean and variance . Find
CRLB for and .
The mle for μ and are respectively. Note that is unbiased for and

that is the unbiased estimator for . , and , from example

II.3.4 part (a). Let ,

, therefore CRLB for is , which

means that is UMVUE for .

, therefore CRLB for is = ,

, thus though is an unbiased estimator for it is not

UMVUE.

d) Consider sampling from uniform on the interval (0 , θ). Find CRLB for θ.

Recall that, if and W = , then W is unbiased for θ with

. Is W UMVUE for θ? Unfortunately we cannot find the CRLB because

the pdf is not smooth.

Remark In some text books UMVUE is simply UMVE (unbiased minimum variance
estimator) or MVUE (minimum variance unbiased estimator) without uniformly.

Mean Square error


Let ˆ be the estimator of the unknown parameter  from the random sample X 1 ,..., X n . Then

clearly the deviation from ˆ to the true value of  , ˆ   , measures the quality of the

estimator, or equivalently, we can use (ˆ   ) 2 for the ease of computation. Since ˆ is a
random variable, we should take average to evaluation the quality of the estimator. Thus, we
introduce the following
Definition2.2.5: The mean square error (MSE) of an estimator ˆ of a parameter  is the
function of  defined by E (ˆ   ) 2 , and this is denoted as MSE  .

This is also called the risk function of an estimator, with (ˆ   ) 2 called the quadratic loss
function. The expectation is with respect to the random variables X 1 ,..., X n since they are the
only random components in the expression.
Notice that the MSE measures the average squared difference between the estimator ˆ and
the parameter  , a somewhat reasonable measure of performance for an estimator. In general,

any increasing function of the absolute distance ˆ   would serve to measure the goodness

 
of an estimator (mean absolute error, E ˆ   , is a reasonable alternative. But MSE has at

least two advantages over other distance measures: First, it is analytically tractable and,
secondly, it has the interpretation

MSE  E ˆ   2

 var(ˆ)  E (ˆ)   
2
 var(ˆ)  ( Bias of ˆ) 2
This is so because


E ˆ   
2
 E (ˆ 2 )  E (9 2 )  2E (ˆ)

 2

 var(ˆ)  E (ˆ)   2  2E (ˆ)
 var(ˆ)  [ E (ˆ)   ]2

Definition 2.2.6: The bias of an estimator ˆ of a parameter  is the difference between the
expected value ˆ and  ; that is, Bias ( ˆ ) = E ( ˆ )-  . An estimator whose bias is identically
equal to 0 is called unbiased estimator and satisfies E ( ˆ ) =  for all  .
Thus, MSE has two components, one measures the variability of the estimator (precision) and
the other measures the bias (accuracy). An estimator that has good MSE properties has small
combined variance and bias. To find an estimator with good MSE properties, we need to find
estimators that control both variance and bias.
For an unbiased estimator ˆ , we have MSEˆ  E (ˆ   )  var(ˆ) , and so, if an estimator is

unbiased, its MSE is equal to its variance.


Example 2.2.5: Suppose X 1 ,..., X n are i.i.d. random variables with density function

1  x 
f (x )  exp   , the maximum likelihood estimator for 

2   
n

X i
ˆ  i 1
, is unbiased.
n
Solution: Let us first calculate E X and E ( X ) 2 as
 
1  x 
E( X )  

x f ( x  )dx  

x
2
exp 
 
dx


 
x  x  dx
 exp      ye  y dy  (2)  
0
   0

 
1  x 
 
2 2 2
E( X )  x f ( x  )dx  x exp  dx

 
2   
 
x2  x  dx
 2 exp     2  y 2 e  y dy  (3)  2 2
0
 2
  0

 X  ...  X n  E X 1  ...  X n
Therefore, E (ˆ )  E  1 
   . So ˆ is an unbiased estimator
 n  n

for  .
Thus the MSE of ˆ is equal to its variance, i.e.
 X 1  ...  X n 
MSE  E (ˆ   ) 2  var(ˆ )  var 
 n 
 
var X 1  ...  var X n var( X
 2

n n
2
E ( X )  ( E ( X )) 2 2 2   2  2
  
n n n
The Statistic S 2 : Recall that if X 1 ,..., X n come from a normal distribution with variance

 2 , then the sample variance S 2 is defined as


n

(X i  X )2
S2  i 1

n 1
(n  1) 2 2
It can be shown that S ~.  n 1 . From the properties of  2 distribution, we have
n
 (n  1) S 2 
  n  1 This  E ( S )  
2 2
E
 
2

And
 (n  1) S 2  2 4
var(   2( n  1) This  var( S 2
) 
  n 1
2

Mean-square error: Measures closeness of an estimator ˆ to its estimand  via


consistency assumes that the Sample size n is very large, actually infinite. As a consequence,
many estimators which are \bad" for any infinite n can be labelled as \good" according to the
Consistency criterion. An alternative measure of closeness is called the mean-square error
(MSE), and is defined as


MSE (ˆ)  E (ˆ   ) 2 
This measures the average (squared) distance between ˆ( X 1 , X 1 , ..., X n ) and  as the data
~
X 1 , X 2 , ..., X n Varies according to F So if ˆ and  are two estimators of  , we say that ˆ
~ ~
is better than  (in the mean-square error sense) if MSE (ˆ)  MSE ( ) .
Next are some properties of the MSE. The first relates MSE to the variance and bias of an
estimator.
Proposition 2.2.1. MSE (ˆ)  V (ˆ)  b0 (ˆ) 2 . Consequently, if ˆ is an unbiased estimator

of  , then MSE (ˆ)  V (ˆ) .

  
Proof. Let   E (ˆ) . Then MSE (ˆ)  E (ˆ   ) 2  E (ˆ   )  ( )  :
2

Expanding the quadratic inside the expectation gives


 
  
 
MSE (ˆ)  E (ˆ   ) 2  2(   )( E (ˆ   )  (   ) 2

The first term is the variance of ˆ ; the second term is zero by definition of  ; and the third
Terms is the squared bias.

Often the goal is to find estimators with small MSEs. From Proposition 2.1, this can be
Achieved by picking ˆ to have small variance and small squared bias. But it turns out that, in
general, making bias small increases the variance, and vice versa. This is what is called the
bias {variance trade off. In some cases, if minimizing MSE is the goal, it can be better to
allow a little bit of bias if it means a drastic decrease in the variance. In fact, many common
estimators are biased, at least partly because of this trade off.
Example 2.2.6. Let X 1 , X 2 , ..., X n be iid N (  , 2 ) and suppose the goal is to estimate  2 .
n
Define the statistic T   ( X  X ) 2 . Consider a class of estimators ˆ 2  aT where a is a
x 1
positive number. Reasonable choices of a include a  (n  1) 1 and a  n 1 . Let's find the
value of a minimizes the MSE.
 1 
First observe that  2 T is a chi-square random variable with degrees of freedom n - 1;
 
that E 2 (T )  (n  1) 2 and V 2 (T )  2(n  1) 4 . Write R(a) for MSE 2 (aT ) .

Using Proposition 2.2.2 we get


R(a)  E 2 (aT   2 ) 2  V 2 (aT )  b 2 (aT ) 2

 2a 2 (n  1) 4  a(n  1) 2   2  2


  4 2a 2 (n  1)  a(n  1)  1
2

To minimize R (a), set the derivative equal to zero, and solve for a. That is,

0  R (a)  a 4 4(n  1)a  2(n  1) 2 a  2(n  1) : 
From here it's easy to see that a  (n  1) 1 is the only solution (and this must be a minimum
n
Since R (a) is a quadratic). Therefore, among estimators of the form ˆ 2  a  ( X  X ) 2 the
x 1

n
one with smallest MSE is ˆ 2  (n  1) ( X i  X ) 2 . Note that this estimator is not unbiased
x 1

since a  (n  1) 1 . To put this another way, the classical estimator S 2 pays a price (larger
MSE) for being unbiased.
Proposition 2.2.3. If MSE (ˆn )  0 as n   , then ˆn is a consistent estimator of  .

   
Proof. Fix   0 and note that P ˆ      P (ˆ   ) 2   2 . Applying Markov's

Inequality to the latter term gives an upper bound of  2 MSE (ˆn ) . Since this goes to zero

By assumption, ˆn is consistent.


2.3 Efficiency 1

As we have already remarked, for any given parameter estimation problem, there are many
different possible choices for estimators. One desirable quality for an estimator is that it be
unbiased. However, this requirement alone does not impose a substantial condition, since (as
we have seen) there can exist several different unbiased estimators for a given parameter. In
the last two examples, we constructed two different unbiased estimators for the parameter θ,
given a random sample X 1 , X 2 ,..., X n from the uniform distribution on [0,θ]:

n 1 2
namely, ˆ1  max( x1 , x2 ,..., xn ) and ˆ2  ( x1 , x2 ,..., xn ) .
n n

Efficiency 2

Likewise, it is also not hard to see that, given a random sample X 1 , X 2 from the normal
1
distribution with mean θ and standard deviation σ, the estimators ˆ1  ( x1  x2 ) and
2
1
ˆ2  ( x1  2 x2 ) are also both unbiased. More generally, any estimator of the form
3
ax1  (1  a) x2 will be an unbiased estimator of the mean. We would now like to know if
there is a meaningful way to say one of these unbiased estimators is better than the other.

Efficiency, 3

In the abstract, it seems reasonable to say that an estimator with a smaller variance is better
than one with a larger variance, since a smaller variance would indicate that the value of the
estimator stays closer to the “true” parameter value more often. We formalize this as follows:

Definition 2.3.1: If ˆ1 and ˆ2 are two unbiased estimators for the parameter θ, we say that ˆ1 is

more efficient than ˆ2 if var( ˆ1 ) < var( ˆ2 ).

2  2
If var( x )  and var( x m )  , this implies var( x ), var( x m ) , we interpret that var(x ) is
n 2 m
var( x ) var(ˆ1 )
more efficiency if    1 . This means   then;
var( xm ) var(ˆ2 )

(i) If   1 this implies that var(ˆ1 ) is more efficient than ˆ2 .4

(ii) If   1 this implies that var(ˆ2 ) is more efficient than ˆ1 .

(iii) If   1 this implies that var(ˆ1 ) are equally efficiency ˆ2 .

Example 2.2.1 Let x1 , x2 ,..., x n are 4 observations taken from N (  , 2 ) population. Find the

1 1 n
efficiency of T  ( x1  3 x2  2 x3  x4 ) relative to x   xi .
7 4 i 1

Solution: We find the expected value of T as;


1
E (T )  E ( x1 )  E (3x3 )  E (2 x3 )  E ( x4 )
7

1
 [   3  2   ]
7
1
 .7 
7


Therefore, T is an unbiased estimator of  . We find the expected value of x . We have


1
E ( x )  [ E ( x1 )  E ( x2 )  E ( x3 )  E ( x4 )]
4
1
 [       ]
4
1
 .4 
4
E(x)  

Now we find the variance of T and x .

1
var(T )  [var( x1 )  9 var( x2 )  4 var( x3 )  var( x 4 )]
49

1 2
 [  9 2  4 2   2 ]
49
15
 2
49

2
2 var( x ) 4  49  1
var( x )  . This implies that,   
4 var(T ) 15 2
60
49

Therefore, x is more efficient than T.

Example 2.3.2: Suppose that a random sample x, y is taken from the normal distribution with
mean θ and standard deviation σ.

1
1. Find the variance of the estimator 1  ( x  y ) .
2

1
2. Find the variance of the estimator ˆ2  ( x  2 y ) .
3
3. Show that ˆ1 and ˆ2 are both unbiased estimators of θ.

4. Which of ˆ1 , ˆ2 is a more efficient estimator of θ?

5. More generally, for ˆa  ax  (1  a) y , which value of a produces the most efficient
estimator?

Solution

To compute the variances and check for unbiasedness, we will use properties of expected
value and the additivity of variance for independent variables.

Suppose that a random sample x, y is taken from the normal distribution with mean θ and
standard deviation σ.

1
1. Find the variance of the estimator ˆ1  ( x  y ) .
2

Note that because x and y are independent, their variances are additive, and
var( X )  var(Y )   2 .Then, we have

1 1
var(ˆ1 )  var( X  Y )
2 2
1 1
 var( X )  var( Y )
2 2
1 1
 var( X )  var(Y )
4 4
1 1
 2  2
4 4
1
 2
2

1 2
2. Thus var(ˆ2 )  var( X  Y )
3 3

1  2 
 var ( X )   var (Y ) 
3  3 
5 2

9

5 2
Thus var(ˆ2 ) 
9
3. Show that ˆ1 and ˆ2 are both unbiased estimators of θ.
1 1 1
We have E (ˆ1 )  E ( ( x  y ))  E ( x)  E ( y )   .
2 2 2
1 1 2
Likewise, E (ˆ2 )  E ( ( x  2 y ))  E ( x)  E ( y )   .
3 3 3
Thus, both estimators are unbiased.
4. Which of ˆ1 , ˆ2 is a more efficient estimator of θ?
1 5 1 5
Since var( ˆ1 ) =  2 while var( ˆ2 ) =  2 , we see ˆ1 is more efficient since  .
2 9 2 9
Example 2.3.3: Suppose that a random sample x, y is taken from the normal distribution with
mean θ and standard deviation σ.
5. More generally, for ˆa  ax  (1  a) y , which value of a produces the most efficient
estimator?
In the same way as before, we can compute
var(ˆa )  var(ax  (1  a) y )
 a 2 var( x)  (1  a) 2 var( y )
 
 a 2  (1  a ) 2  (2a 2  2a  1) 2
1
By calculus (the derivative is 4a−2 which is zero for a  ) or by completing the square
2

(2a 2  2a  1  2(a  1 ) 2  1 ) we can see that the minimum of the quadratic occurs when
2 2
1
a .Thus, in fact, ˆ1 is the most efficient estimator of this form.
2
1
(Answer: a  .)Intuitively, this last calculation should make sense, because if we put more
2
weight on one observation, its variation will tend to dominate the calculation. In the extreme
situation of taking ˆ3  x3 (which corresponds to a= 0), for example, we see that the variance

is simply  2 , which is much larger than the variance arising from the average. This is quite
1
reasonable, since the average ( x1  x2 ) uses a bigger sample and thus captures more
2
information than just using a single observation.
Example2.3.4 A random sample X 1 , X 2 ,..., X n is taken from the uniform distribution on
[0,θ].
1. Find the variance of the unbiased estimator
n 1
ˆ1  max( x1 ,..., x n ) .
n
2
2. Find the variance of the unbiased estimator 2  ( x1  x2  ...  xn )
n
3. Which estimator is a more efficient estimator for θ?
Solution
n 1
nx
To compute the variance of 1 , we use the pdf of max ( x1 ,..., xn ) , which is g ( x)  n for

0  x  .


Then E max( x1 ,..., xn )   x . 2
 2 nx n1
 n
dx 
n
n2
 2 . Also,
0


 2

E max( x1 ,..., x n )   x.
nx n1
 n
dx 
n 2
n 1
 ,
0

2
n  n  n
so var[max(x1 ,..., xn )]  2    2.
n2  n  1 (n  2)(n  1) 2

Therefore,
2
 n  1 1
var(ˆ1 )    var[max(x1 ,..., xn )]  2
 n  n(n  10

2. For ˆ2 since the x are independent, their variances are additive.


1 2
Hence, var( xi )   x. dx  .
0
 12

2 4 2
Thus, var(2 )  var( x1 )  ...  var( xn )  n. 2 .
nx n . 12

1 1
3. We just calculated var(ˆ1 )   2 and var(ˆ2 )   2 .
n(n  2) 3n
For n = 1 these variances are the same (this is unsurprising because when n = 1 the
estimators themselves are the same!).
1 1
For n > 1 we see that the variance of ˆ1 is smaller since  , so ˆ1 is more efficient.
n2 3
Cramer-Rao lower Bound
The variance of any estimator is always bounded below (since it is by definition nonnegative). So it is
quite reasonable to ask whether, for a fixed estimation problem, there might be an optimal unbiased
estimator: namely, one of minimal variance.
This question turns out to be quite subtle, because we are not guaranteed that such an estimator
necessarily exists. For example, it could be the case that the possible variances of unbiased estimators
form an open interval of the form (a, ) for some a  0 .
Then there would be estimators whose variances approach the value a arbitrarily closely, but there is
none that actually achieves the lower bound value a.
There is a lower bound on the possible values for the variance of an unbiased estimator:
Theorem (Cramer’-Rao Inequality):
Suppose that PX ( x; ) is a probability density function that is differentiable in  . Also suppose that

the support of p, the set of values of x where PX ( x; )  0 , does not depend on the parameter  . If

X 1 ,..., X n is a random sample drawn from X, ˆ  f ( x1 ,..., f n ) is an unbiased estimator of  , and


1
  lnPX ( x; ) denotes the log-pdf of the distribution, then var(ˆ)  where
l ( )

   2 
l ( )  n.E    .
   
In the event that PX pX is twice-differentiable in  , it can be shown that l ( ) I (_) can also be

   2 
calculated as l ( )  n.E   .
   
A few remarks:
 The proof of the Cramer-Rao inequality is rather technical (although not conceptually difficult),
so we will omit the precise details.
 In practice, it is not always so easy to evaluate the lower bound in the Cramer-Rao inequality.
 Furthermore, there does not always exist an unbiased estimator that actually achieves the Cramer-
Rao bound.
 However, if we are able to find an unbiased estimator whose variance does achieve the Cramer-
Rao bound, then the inequality guarantees that this estimator is the most efficient Possible.
Example 2.3.5 Suppose that a coin with unknown probability  of landing heads is flipped n times,
yielding results X 1 ,..., X n (where we interpret heads as 1 and tails as 0).

1
Let ˆ  ( x1  ...  x n ) .
n
1. Show that ˆ is an unbiased estimator of  .
2. Find the variance of ˆ .

3. Show that ˆ has the minimum variance of all possible unbiased estimators of  .
Solution
We first need to compute the expected value and variance of this estimator. Then we need to evaluate
the lower bound in the Cramer-Rao inequality.
1. The claim is that the given estimator actually achieves this lower bound.
Since X 1  X 2  ...  X n is binomially distributed with parameters n and  , its expected value is

1
n  . Then E ( ˆ ) = n   , so ˆ is unbiased.
n
2. Since X 1  X 2  ...  X n is binomially distributed with parameters n and  , its variance

is n (1   ) .

1
Then the variance of ˆ  ( x1  ...  x n ) is
n
1  (1   )
var(ˆ)  2
.n (1   )  .
n n
3. We compute the Cramer-Rao bound: if   lnPX ( X ; ) is the log-pdf of the distribution, then

1    2 
ˆ
var( )  where l ( )  nE    . Here, the likelihood function can be written as
l ( )    
L( X ; )   x (1   )1 x (it is  if x = 1 and 1 -  if x = 0), so that   x ln   (1  x) ln(1   ) .

 2 X 1 X
Differentiating twice yields  2  .
 2
 (1   ) 2
So, since E(x) =  , the expected value is

  2   E ( X ) E (1  X )
E  2   
   2 (1   ) 2
1

 (1   )
 (1   )
From above the Cramer-Rao bound is var(ˆ)  .
n
1
But we calculated before that for our unbiased estimator ˆ  ( x1  ...  x n ) , we do in fact
n
 (1   )
have var(ˆ)  .
n
Therefore, our estimator ˆ achieves the Cramer-Rao bound, so it has the minimum variance of all
possible unbiased estimators of  , as claimed.
So, in fact, the obvious estimator is actually the best possible!
1
Example2.3.6: Show that the maximum-likelihood estimator ˆ  ( x1  ...  x n ) is the most
n
efficient possible unbiased estimator of the mean of a normal distribution with unknown
mean    and known standard deviation  .
Solution
We will show that this estimator achieves the Cramer-Rao bound.


1

For this, we first compute the log-pdf   ln 2  ln( ) 
2
( X  )2
2 2
.

 X   2 1
Differentiating yields  and then 2  2 .
  2
 
 2
Then, the Cramer-Rao bound dictates that var(ˆ)  for any estimator ˆ .
n
 2
The Cramer-Rao bound says var(ˆ)  .
n
For our estimator, since the X i are all independent and normally distributed with mean  and

1 1 2
standard deviation  , we have var(ˆ )  var( X 1 )  ...  var( X n 
)  .n 2
 .
n2 n2 n
Thus, the variance of our estimator ˆ achieves the Cramer-Rao bound, meaning that it is

the most efficient unbiased estimator possible.

In the comparison of various statistical procedures, efficiency is a measure of quality of an


estimator, of an experimental design, or of a hypothesis testing procedure. Essentially, a more
efficient estimator, experiment, or test needs fewer observations than a less efficient one to
achieve a given performance

Definition 2.3.2: The efficiency of an unbiased estimator T for θ is defined to be the ratio of
Var(T) and Cramer-Rao Lower Bound (CRLB), it is denoted by e(T) and is given by

, if the probability function is smooth. Note that for UMVUE the efficiency is

one.

Example 2.3.1 Consider a random sample from Normal distribution with mean and
variance , Note that is unbiased for and that is the unbiased estimator for .

, and
Definition 2.3.3 (relative efficiency) Let T1 and T2 be two unbiased estimators of θ with
variances Var(T1) and(T2) respectively.

(i) We call T1 to be more efficient than T2 if Var(T1) < Var(T2)

(ii) The relative efficiency of T1 with respect to T2 denoted by RE (T1,T2) is defined as a ratio

(iii) The relative efficiency of T2 with respect to T1 denoted by RE (T2,T1) is defined as a

ratio .

Example 2.2.2 Let X1, X2, X3… Xn be a random sample of size n from uniform on the
interval ( 0, θ). Also let T1 = (n+1)Y1 and , were Y1 and Yn are

and respectively. Using the density function


for the 1st order statistics and the nth order statistics we can easily show that, E(T1) = θ,

E(T2) = θ, and

and , thus T2 is n2 times more

efficient than T1.

Summary

We discussed biased and unbiased estimators. We discussed efficiency of estimators. We


stated the Cramer-Rao bound and used it to show that some of our unbiased estimators were
the most efficient possible ones. Next lecture: Interval estimation and confidence intervals.

2.4 SUFFICIENCY

Introduction
In the lesson on Point Estimation, we derived estimators of various parameters using two
methods, namely, the method of maximum likelihood and the method of moments. The
estimators resulting from these two methods are typically intuitive estimators. It makes sense,
for example, that we would want to use the sample mean ˆ and sample variance  2 to
estimate the mean  and variance  2 of a normal population. In this lesson, we'll learn how
to find statistics that summarize all of the information in a sample about the desired
parameter. Such statistics are called sufficient statistics, and hence the name of this lesson.
Definition2.4.1: Let X 1 , ... , X n be a random sample from a probability distribution with

unknown parameter  . Then, the statistic: Y  u ( X 1 , ..., X n ) is said to be sufficient for  if

the conditional distribution of X 1 , ..., X n , given the statistic Y , does not depend on the

parameter  ie, , i.e, Is free of

Example 2.4.1

Let X 1 , ... , X n be a random sample of Bernoulli trials in which:

 X i  1 if the i th subject likes Pepsi p

 X i  0 if the i th subject does not like Pepsi q = 1- p

If p is the probability that subject likes Pepsi, for i  1,2,3,.....n , then p probability with
n
probability Suppose, in a random sample of n = 40 people, that Y   X i people like Pepsi.
i 1

If we know the value of Y , the number of successes in n trials, can we gain any further
information about the parameter p by considering other functions of the data X 1 , ... , X n ? That
is, is Y sufficient for p?

Solution

The definition of sufficiency tells us that if the conditional distribution of X 1 , ... , X n , given

the statistic Y , does not depend on p , then Y is a sufficient statistic for X 1 , ... , X n . The

conditional distribution of X 1 , ... , X n , given Y , is by definition:


P( X 1  x1 , ..., n  x n )
P( X 1  x1 , ..., X n  x n )  (**)
P(Y  y )

Now, for the sake of concreteness, suppose we were to observe a random sample of size n in
which x1  1, x2  0 , and x3  1 . In this case: P( X 1  1, X 2  0 , X 3  1, Y  1)  0 because the
n
sum of the data values, X
x 1
i , is 1 + 0 + 1 = 2, but Y , which is defined to be the sum of the

X i 's is 1. That is, because 2  1 , the event in the numerator of the starred (**) equation is an
impossible event and therefore its probability is 0.

Now, let's consider an event that is possible, namely P( X 1  1, X 2  0 , X 3  1, Y  2)  0 . In

that case, we have, by independence: P ( X 1  1, X 2  0, X 3  1, Y  2)  p (1  p ) p  p 2 (1  p )

n
So, in general: P( X 1  x1 , ...., X n  x n )  0 if x
x 1
i  y and:

n
P ( X 1  x1 , .... , X n  x nY  y )  P y (1  p ) n y if x
x 1
i y

Now, the denominator in the starred (**) equation above is the binomial probability of
getting exactly successes in trials with a probability of success

n
. That is, the denominator is: P(Y  y )    P y (1  p ) n y for y = 1, 2, …,n. Putting the
 y
numerator and denominator together, we get, if , that the conditional probability is:
P y (1  P) n y 1
P( X 1  x1 , ...., X n  xnY  y )   ,
n y n y n
  P (1  p)  
 y  y

n n
if  xi  y and: P( X 1  x1 , ...., X n  xn )  0 if
x 1
xx 1
i y

We have just shown that the conditional distribution of X 1 , ... , X n given Y does not depend on
P. Therefore, Y is indeed sufficient for P. That is, once the value of Y is known, no other
function of X i s will provide any additional information about the possible value of P.
Factorization Theorem 2.4.1
While the definition of sufficiency provided may make sense intuitively, it is not always all
that easy to find the conditional distribution of X 1 , ... , X n given Y. We can mention that we'd
have to find the conditional distribution given that we'd want to consider a possible sufficient
statistic! Therefore, using the formal definition of sufficiency as a way of identifying a
sufficient statistic for a parameter  can often be a daunting road to follow. Therefore, a
theorem often referred to as the Factorization Theorem provides an easier method! We state it
here without proof.
Factorization Theorem: Let X 1 , ... , X n denote random variables with joint probability density

function or joint probability mass function f ( x1 , x2 ,..., x n ; ) , which depends on the parameter

 . Then, the statistic Y  u ( X 1 ,...., X n ) is sufficient for  if and only if the p.d.f (or p.m.f.)
can be factored into two components, that is:
f ( x1 ,..., xn ; )   u ( x1 ,..., xn ); ; h( x1 ,..., x n ) where:

  is a function that depends on the data only through the function , and
 the function h( x1 ,..., xn ) does not depend on the parameter 

Let's put the theorem to work on a few examples!

Example 2.4.2

Let X 1 , ... , X n denote a random sample from a Poisson distribution with parameter   0 .

Find a sufficient statistic for the parameter  .

Solution

Because X 1 , ... , X n is a random sample, the joint probability mass function of X 1 , ... , X n is,
by independence:

f ( x1 ,..., xn ;  )  f ( x1 ;  ) xf ( x2 ;  ) x.....xf ( xn ;  )

Inserting what we know to be the probability mass function of a Poisson random variable
with parameter  , the joint p.m.f. is therefore:
e    x1 e    x2 e    xn
f ( x1 ,..., xn ;  )  x x .....x
x1! x2 ! xn !

Now, simplifying, by adding up all n of the  s in the exponents, as well as all of the

xi 's in the exponents, we get:

 1 
f ( x1 ,..., xn ;  )  (e n  i ) x
x

 x1! x2 !...x n ! 

We just factored the joint p.m.f. into two functions, one (  ) being only a function of the
n
statistic Y   X i and the other (h) not depending on the parameter  :
x 1

n
Therefore, the Factorization Theorem tells us that Y   X i is a sufficient statistic for  . We
x 1

can also write the joint p.m.f. as:

 1 
f ( x1 ,..., xn ;  )  (e n nx ) x 
 x1! x2 !...xn ! 

Therefore, the Factorization Theorem tells us that Y  Xˆ is also a sufficient statistic for  !

n
If you think about it, it makes sense that Y  Xˆ and Y   X i are both sufficient statistics,
x 1

n n
because if we know Y  Xˆ , we can easily find Y   X i . And, if we know Y   X i , we
x 1 x 1

can easily find Y  Xˆ .


The previous example suggests that there can be more than one sufficient statistic for a
parameter  . In general, if Y is a sufficient statistic for a parameter  , then every one-to-one
function of Y not involving  is also a sufficient statistic for  . Let's take a look at another
example.

Example 2.4.3 Let X 1 , ... , X n be a random sample from a normal distribution with mean 

and variance 1. Find a sufficient statistic for the parameter  .

Solution

Because X 1 , ... , X n is a random sample, the joint probability density function of X 1 , ... , X n is,
by independence:

f ( x1 ,..., xn ;  )  f ( x1 ;  ) xf ( x2 ;  ) x ... xf ( xn ;  )

Inserting what we know to be the probability density function of a normal random variable
with mean  and variance 1, the joint p.d.f. is:

1  1  1  ( x2   ) 2  1  1 
f ( x1 , x2 , ... , xn ;  )  1
exp ( x1   ) 2  x 1
exp   x....x 1
exp ( xn   ) 2 
(2 ) 2  2  (2 ) 2  2  (2 ) 2  2 
Collecting like terms, we get:

1  1 n 
f ( x1 , x 2 , ... , x n ;  )  n
exp   ( xi   ) 2 
( 2 ) 2  2 i 1 

A trick to making the factoring of the joint p.d.f. an easier task is to add 0 to the quantity in
parentheses in the summation. That is:

Now, squaring the quantity in parentheses, we get:


1  1 
f ( x1 , x 2 , ... , x n ;  )  n
exp   [( xi  x ) 2  2( xi  x )( x   )  ( x   ) 2 ]
(2 ) 2
 2 

And then distributing the summation, we get:

1  1 n
1 n 
f ( x1 , x 2 , ... , x n ;  )  n
exp   [( xi  x ) 2  ( x   ) ( xi  x )   ( x   ) 2 ]
(2 ) 2
 2 i 1 2 i 1 

But, the middle term in the exponent is 0, and the last term, because it doesn't depend on the
index i, can be added up n times:

So, simplifying, we get:

  n   1 1 n 
f ( x1 , x 2 , ... , x n ;  )  exp ( x   ) 2  x 
  2
n
  ( 2 ) 2
exp  
2 i 1
( xi   ) 2 


In summary, we have factored the joint p.d.f. into two functions, one (  ) being only a
function of the statistic and the other (h) not depending on the parameter  :

Therefore, the Factorization Theorem tells us that Y  Xˆ is a sufficient statistic for  . Now,

Y  X̂ 3 is also sufficient for  , because if we are given the value of X̂ 3 , we can easily get
1
the value of X̂ through the one-to-one function w  y 3 . That is:
1
W  (x3 ) 3
 Xˆ

On the other hand, Y  X̂ 2 is not a sufficient statistic for  , because it is not a one-to-one

function. That is, if we are given the value of X̂ 2 , using the inverse function:

1
w y 2

we get two possible values, namely:  X̂ and  X̂

We're getting so good at this, let's take a look at one more example!

Example 2.4.4

Let X 1 , ... , X n be a random sample from an exponential distribution with parameter  . Find a

sufficient statistic for the parameter  .

Solution

Because X 1 , ... , X n is a random sample, the joint probability density function of X 1 , ... , X n s
is, by independence:

f ( x1 ,..., xn ; )  f ( x1 ; ) xf ( x 2 ; ) x ... xf ( xn ; )

Inserting what we know to be the probability density function of an exponential random


variable with parameter  , the joint p.d.f. is:

1  x  1  x  1  x 
f ( x1 , x 2 , ... , x n ;  )  exp  i  x exp  2  x....x exp  n 
           

Now, simplifying, by adding up all n of the  s and then xi 's in the exponents, we get:

1  1 n 
f ( x1 , x 2 , ... , x n ; )  exp   xi 
   i 1 
n

We have again factored the joint p.d.f. into two functions, one (  ) being only a function of
n
the statistic Y   X i and the other (h) not depending on the parameter  :
i 1
n
Therefore, the Factorization Theorem tells us that Y   X i is a sufficient statistic for  .
i 1

n
And, since is a one-to-one function of Y   X i , it implies that Y  Xˆ is also a sufficient
i 1

statistic for  .

Exponential Form

You might not have noticed that in all of the examples we have considered so far in this
lesson, every p.d.f. or p.m.f. could we written in what is often called exponential form, that
is:

f ( x; )  exK ( x) p( )  S ( x)  q( ) with

1. K(x) and S(x) being functions only of x ,


2. p ( ) and g ( ) being functions only of the parameter  .
3. The support being free of the parameter 

First, we had Bernoulli random variables with p.m.f. written in exponential form as:

With:

1. K(x) and S(x) being functions only of x ,


2. p ( p ) and g ( p ) being functions only of the parameter p .
3. The support x = 0, 1 not depending on the parameter p
To get this form you can use the laws of logarithms:
Take for example a Bernoulli distribution whose pdf is given by:

f ( x; p)  p x (1  p)1 x

You take log on both sides and you get:


ln f ( x; p)  ln p x (1  p)1 x 
This becomes:

 
f ( x; p)  exp ln p x (1  p)1 x   using laws of logarithms, you get
  p  
f ( x; p)  exp x ln   ln(1)  ln(1  p)
 1 p  

Example 2.4.5

Write the Poisson random variables in exponential form whose p.m.f. is given as:

e  x
f ( x;  )  :
x!

with

1. K(x) and S(x) being functions only of x ,


2. p ( ) and g ( ) being functions only of the parameter  .
3. The support x = 0,1, …being free of the parameter 

We can write the normal random variables whose p.d.f. can be written in exponential form as:
1. K(x) and S(x) being functions only of x ,
2. p (  ) and g (  ) being functions only of the parameter  .
3. The support    x   being free of the parameter 

Example 2.4.6

Then, we have exponential random variables random variables whose p.d.f. can be written in
exponential form as:

with

1. K(x) and S(x) being functions only of x ,


2. p (  ) and g (  ) being functions only of the parameter  .
3. The support x  0 being free of the parameter 

Writing p.d.f.s and p.m.f.s in exponential form provides you a third way of identifying
sufficient statistics for our parameters.

Theorem. Exponential Criterion: 2.4.2

Let X 1 , X 2 ,..., X n be a random sample from a distribution with a p.d.f. or p.m.f. of the

exponential form: f ( x; )  exK ( x) p( )  S ( x)  q( ) , with a support that does not depend
n
on  . Then, the statistic:  K(X
i 1
i ) is sufficient for  .

Exercise
Let X 1 , X 2 ,..., X n be a random sample from a geometric distribution with parameter p. Find a

sufficient statistic for the parameter p .


Remarks
(i) Generally sufficient statistics are involved in the construction of maximum likelihood
estimators.

(ii) The order statistics form a joint sufficient statistics for the parameter for any
distribution

(iii) A set of statistics (T1 , T2 , ….Tk) is called minimal sufficient statistics if the members
of the set are joint sufficient statistics for the parameter and if they are a function of every
other set of joint sufficient statistics (Rao-Blackwell theorem)
Confidence intervals
Confidence intervals (CIs) provide a method of quantifying uncertainty to an
estimator ˆ when we wish to estimate an unknown parameter  . We want to find an interval
(A, B) that we think has high probability of containing  .
Definition: Suppose that X n  ( x1 ,..., x n ) is a random sample from a distribution

P     R k (That depends on a parameter  ).

Suppose that we want to estimate g (  ), a real-valued function of  .


Let A  B be two statistics that have the property that for all values
of  , P ( A  g ( )  B)  1   ; where   (0,1) .
Then the random interval (A, B) is called a confidence interval for g (  ) with level
(Coefficient) ( 1   ). If the inequality '  1   ' is an equality for all  , the CI is called exact.
Example 1: Find a level ( 1   ) CI for  from data which are i.i.d. N ( (  ;  2 ) ) where  is
known. Here    and g (  ) =  .
Solution
We want to construct ( X 1 ,..., X n ;  ) such that the distribution of this object is known to us.
How do we proceed here?
The usual way is to find some decent estimator of  and combine it along with  in some
way to get a \pivot", i.e., a random variable whose distribution does not depend on  .
The most intuitive estimator of  here is the sample mean X n . We know that

X n ~. N (  ; 
2
):
n
The standardized version of the sample mean follows N (0; 1) and can therefore act
as a pivot. In other words, construct,

X  n ( X  )
( X n ,  )   ~. N (0,1)
 
n
For every value of  . With z  denoting the upper  -th quantile of N (0; 1) (i.e., P (Z > z  ) =

 where Z follows N (0; 1)) we can write:

 n(X  ) 
P   z   1   
 2  
From the above display we can find limits for  such that the above inequalities are
Simultaneously satisfied. On doing the algebra, we get:
   
P  X  z    X  z   1  
 n 2 n 2
Thus our level ( 1   ) CI for  is given by

   
X  z , X  z 
 n 2 n 2
Often a standard method of constructing CIs is the following method of pivots which we
describe below.
(1) Construct a function  using the data X n and g ( ) , say  ( X n , g ( )) , such that the

distribution of this random variable under parameter value  does not depend on  and is
known. Such a  is called a pivot.
(2) Let G denote the distribution function of the pivot. The idea now is to get a range of
plausible values of the pivot. The level of confidence 1   is to be used to get the
appropriate range.
This can be done in a variety of ways but the following is standard. Denote by q (G;  ) The

 th quantile of G. Thus, P [ ( X n , g ( ))  q(G;  )]  


(3) Choose 0   1 ,  2   such that 1   2   . Then,
P [[qG;  1 )   ( X n g ( ))  q (G;1   2 )]  1   2   1  1  
(4) Vary  across its domain and choose your level 1   confidence interval (set) as the set
of all g(  ) such that the two inequalities in the above display are simultaneously satisfied.
Example 2: The data are the same as in Example 1 but now  2 is no longer known. Thus,
the parameter of unknowns   (  ,  2 ) and we are interested in finding a CI for g ( )   .
Solution
Clearly, setting
(X  )
( X n ,  )  ,

will not work smoothly here. This certainly has a known (N(0; 1)) distribution but involves
the nuisance parameter  making it difficult get a CI for  directly.

However, one can replace  by s, where s2 is the natural estimate of  2 introduced before.
So, set:

(X  )
( X n ,  ) 
s
This only depends on the data and g(  ) =  . We claim that this is indeed a pivot.
To see this write
n(X  )
(X  ) 

s s 2

2
The numerator on the extreme right of the above display follows N(0; 1) and the denominator
is independent of the numerator and is the square root of a  n21 random variable over its

degrees of freedom. It follows from definition that  ( X n ,  ) ~. t n 1 distribution.

Thus, G here is the t n 1 distribution and we can choose the quartiles to be q(t n 1  )
2

and q(t n 1 ,1   ) . By symmetry of the t n 1 distribution about 0, we have,


2

q(t n 1 ,1   )   q(t n 1 ,1   ) .
2 2
 n(X  ) 
It follows that, P , 2  q (t n 1,1 )   q (t n1,1   1  
 2 s 2

As with Example 1, direct algebraic manipulations show that this is the same as the
Statement:
 s s 
P , 2  X  q(t n 1,1 )    X  q(t n1,1 ) )  1  
 n 2 n 2 

This gives a level 1   confidence set for  .


Remark: In each of the above examples there are innumerable ways of decomposing 
as  1   2 . It turns out that when  is split equally the level 1   CIs obtained in Examples
1 and 2 are the shortest.
What are desirable properties of confidence sets? On one hand, we require high levels of
confidence; in other words, we would like  to be as small as possible. On the other hand we
would like our CIs to be shortest possible.
Unfortunately, we cannot simultaneously make the confidence levels of our CIs go up and the
lengths of our CIs go down.
In Example 1, the length of the level ( 1   ) CI is
z
2
2 :
n
As we reduce  (for higher confidence), z increases, making the CI wider.
2
However, we can reduce the length of our CI for a fixed _ by increasing the sample size. If
my sample size is 4 times yours, I will end up with a CI which has the same level as yours but
has half the length of your CI. Can we hope to get absolute confidence, i.e.  = 0? That is too
much to ask. When  = 0, z   and the CIs for _ are infinitely large. The same can be
2

verified for Example 2


Asymptotic pivots using the central limit theorem: The CLT allows us to construct an
approximate pivot for large sample sizes for estimating the population mean  for any
underlying distribution F.
Let X 1 ,..., X n be i.i.d observations from some common distribution F and let

E ( X 1 )   and var( X 1 )   2 :
We are interested in constructing an approximate level ( 1   ) CI for  .

By the CLT we have X ~.appx N (  , 


2
) for large n; in other words,
n

n(X  )
~.appx N (0,1) .

If  is known the above quantity is an approximate pivot and following Example 1, we can
therefore write,
 n(X  ) 
P ,  z   z1   1  
 2  2

As before, this translates to
   
P  X  q ( z )    X  z ) )  1  
 n 2 n 2 
This gives an approximate level ( 1   ) CI for  when  is known.
The approximation will improve as the sample size n increases. Note that the true coverage of
the above CI may be different from ( 1   ) and can depend heavily on the nature of F and the
sample size n.
Realistically however  is unknown and is replaced by s. Since we are dealing with large
sample sizes, s is with very high probability close to  and the interval
 s s 
 X  z , X  z  ,
 n 2 n 2
Still remains an approximate level ( 1   ) CI.
Interpretation of confidence intervals: Let (A;B) be a coefficient  confidence interval for
a parameter  . Let (a; b) be the observed value of the interval. It is NOT correct to say that 
lies in the interval (a; b) with probability  . It is true that will lie in the random intervals
having endpoints A ( X 1 ,..., X n ) and B ( X 1 ,..., X n ) with probability  .

After observing the specific values A ( X 1 ,..., X n ) = a and B ( X 1 ,..., X n ) = b, it is not possible

to assign a probability to the event that  lies in the specific interval (a; b) without regarding
 as a random variable. We usually say that there is confidence  that  lies in the interval
(a; b).

You might also like