[go: up one dir, main page]

0% found this document useful (0 votes)
5 views17 pages

Intro Stat

The lecture notes by Kosuke Imai discuss the principles of statistical inference, particularly in the context of causal inference in political science. It introduces the potential outcomes framework, emphasizing the challenges of estimating causal effects due to unobservable counterfactuals and outlines methods for analyzing randomized experiments, including Fisher's randomization inference and Neyman's analysis. The notes also define various causal effects and their estimators, highlighting the importance of understanding both sample and population average causal effects.

Uploaded by

Liben Hagos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views17 pages

Intro Stat

The lecture notes by Kosuke Imai discuss the principles of statistical inference, particularly in the context of causal inference in political science. It introduces the potential outcomes framework, emphasizing the challenges of estimating causal effects due to unobservable counterfactuals and outlines methods for analyzing randomized experiments, including Fisher's randomization inference and Neyman's analysis. The notes also define various causal effects and their estimators, highlighting the importance of understanding both sample and population average causal effects.

Uploaded by

Liben Hagos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Applied Statistics Lecture Notes

Kosuke Imai

Department of Politics
Princeton University

February 2, 2008

Making statistical inferences means to learn about what you do not observe, which is called
parameters, from what you do observe, which is called data. We learn the basic principles of
statistical inference from a perspective of causal inference, which is a popular goal of political
science research. Namely, we study statistics by learning how to make causal inferences with
statistical methods.

1 Statistical Framework of Causal Inference


What do we exactly mean when we say “An event A causes another event B”? Whether explicitly
or implicitly, this question is asked and answered all the time in political science research. The
most commonly used statistical framework of causality is based on the notion of counterfactuals
(see Holland, 1986). That is, we ask the question “What would have happened if an event A were
absent (or existent)?” The following example illustrates the fact that some causal questions are
more difficult to answer than others.

Example 1 (Counterfactual and Causality) Interpret each of the following statements as a


causal statement.

1. A politician voted for the education bill because she is a democrat.

2. A politician voted for the education bill because she is liberal.

3. A politician voted for the education bill because she is a woman.

In this framework, therefore, the fundamental problem of causal inference is that the coun-
terfactual outcomes cannot be observed, and yet any causal inference requires both factual and
counterfactual outcomes. This idea is formalized below using the potential outcomes notation.

Definition 1 (Potential Outcomes) Let Ti be the causal (or treatment) variable of interest for
unit i where i = 1, 2, . . . , n. Ti is a random variable which takes a value in a set T . The potential
outcome Yi (t) represents the outcome that would be observed for unit i if it receives the treatment
whose value is t, i.e., Ti = t for t ∈ T .
We use Yi to denote the observed outcome for unit i. The treatment variable determines which
of potential outcomes will be revealed. This can be seen, for example, from the fact that if the
treatment is binary, the observed outcome is given by, Yi = Ti Yi (1) + (1 − Ti )Yi (0).

1
The framework described above makes an implicit but important assumption that the treatment
status of one unit does not affect the potential outcomes of another unit. This can be formalized
as follows,

Assumption 1 (No Interference Between Units) Formally, let T and T e be an n dimensional


vector of treatment assignment, whose ith element represents the treatment value of unit i where
i = 1, 2, . . . , n. Let Yi (T) be the potential outcome of unit i given the treatment assignment for all
units, i.e., T. Then, the assumption implies that Yi (T) = Yi (T) e whenever Ti = T e i.

This assumption is sometimes called the Stable Unit Treatment Value (SUTVA) assumption. Now,
consider the following examples.

Example 2 Is the assumption of no interference between units violated in the following examples?

1. (Flu vaccine) Units: individuals, Treatment: flu shot, Potential outcomes: hospitalization
with and without the shot.

2. (Negative campaign in elections) Units: candidates, Treatment: Use of negative ads, Potential
outcomes: Vote shares with and without the negative ads.

Of course, the potential outcomes framework described above is “a” model of causal inference,
which has proven to be effective in many applied settings. The philosophical discussion of what is
causality is an interesting topic, but is beyond this course.
Now, any causal quantities of interest for each unit can be written as a function of these
potential outcomes. For the notational simplicity, we consider the situation where the treatment
variable Ti is binary (i.e., T = {0, 1}), which implies that there are two potential outcomes for
each unit, i.e., Yi (1) and Yi (0). However, the arguments presented here can be extended directly
to the causal inference with a multi-valued treatment variable. We first give the definition of some
frequently used causal effects for each unit.

Definition 2 (Unit Causal Effects) Let Ti be a binary treatment variable. The following causal
effects can be defined for each unit.

1. Difference: Yi (1) − Yi (0)

2. Ratio: Yi (1)/Yi (0)

3. Percentage Increase: 100 · [Yi (1) − Yi (0)]/Yi (0)

Because the potential outcomes, Yi (1) and Yi (0), are never jointly observed, the joint distri-
bution of the potential outcomes, P (Yi (1), Yi (0)), cannot be directly inferred from the data. This
implies that the distribution of unit causal effects, e.g., P (Yi (1) − Yi (0)), also cannot be estimated
directly from the data without additional assumptions. However, as we see later, the average causal
effects can be identified in some situations and are often quantities of interest (e.g., Imbens, 2004).
We first consider the sample average differences of potential outcomes.

Definition 3 (Sample Average Causal Effects) Let Ti be a binary (random) treatment vari-
able for unit i where i = 1, 2, . . . , n. Consider fixed (i.e., non-random) but possibly unknown
potential outcomes, Yi (1) and Yi (0), for each i. Then, the following sample average causal effects
of interest can be defined.

1. (Sample Average Causal Effect) n1 ni=1 Yi (1) − Yi (0).


P

2
Pn
2. (Sample Average Causal Effect for the Treated) Pn1 − Yi (0)).
i=1 Ti
i=1 Ti (Yi (1)

The distinction between the sample and population causal quantities is important.

Definition 4 (Population Average Causal Effects) Let Ti be a binary (random) treatment


variable for unit i where i = 1, 2, . . . , n. Let (Yi (0), Yi (1)) be a simple random sample of potential
outcomes from a population. Then, the following population average causal effects of interest can
be defined.

1. (Population Average Causal Effect) E[Y (1) − Y (0)].

2. (Population Average Causal Effect for the Treated) E[Y (1) − Y (0) | T = 1].

The subscript i can be dropped because it is a simple random sample. We can also define the
conditional average causal effects given the observed characteristics of each unit in the sample.

Definition 5 (Conditional Average Causal Effects) Let Ti be a binary (random) treatment


variable for unit i where i = 1, 2, . . . , n. Let (Yi (0), Yi (1), Xi ) be a simple random sample from a
population where Yi (0) and Yi (1) denote potential outcomes and Xi is a vector of characteristics
for each unit i. Then, the following conditional average causal effects of interest can be defined.

1. (Conditional Average Causal Effect) n1 ni=1 E[Yi (1) − Yi (0) | Xi ].


P

2. (Conditional Average Causal Effect for the Treated) Pn1 Ti ni=1 Ti E[Yi (1) − Yi (0) | Xi ].
P
i=1

Here, the subscript i is retained because the conditional expectation is taken with respect to a
particular unit i whose characteristics are represented by Xi .

3
2 Statistical Analysis of Classical Randomized Experiments
In this section, we first consider statistical analysis of classical randomized experiments as a way to
motivate the general theory of statistical inference.

2.1 Fisher’s Hypothesis Testing


Ronald A. Fisher was the first to come up with the idea that randomized experiments can be used
to test a scientific hypothesis. Before him, scientists were using controlled experiments where they
tried to minimize the differences between the treatment and control groups (except the fact that
the former receives the treatment and the latter does not), as much as possible. However, controlled
experiments had two problems. First, researchers can never make the conditions completely identi-
cal for each group. Second and more importantly, when these differences are not eliminated, there
is no way for researchers to quantify the error that result caused by those uncontrolled differences.
To overcome these problems, Fisher (1935) proposed the use of randomized experiments and
illustrated its use with the following famous example,

Example 3 (Lady Tasting Tea) In one summer afternoon in 1919, Cambridge, England, a
group of university dons, their wives, and some guests were sitting around an outdoor table for
afternoon tea. A lady declared, “Tea tastes different depending on whether the tea was poured into
the milk or whether the milk was poured into the tea.” How should one test this proposition using
a randomized experiment?
This simple example can be generalized into the method called, randomization (or permutation)
inference. In the potential outcomes framework described earlier, the randomization of treatment
guarantees the independence between the treatment and potential outcomes.

Definition 6 (Randomization of Treatment) The treatment is said to be randomized if the


treatment variable Ti is independent of all potential outcomes, Yi (t), for all units, i.e., Yi (t) ⊥ Ti
for all t and all i.
Note that there are many ways of randomizing the treatment. For example, simple random as-
signment assigns the treatment to each unit independently with equal probability, while completely
random assignment (what Fisher did) randomly selects the predetermined number of units which
receive the treatment. Other designs include matched pair design, randomized blocks, and Latin
square.
We now formalize Fisher’s randomization inference as follows,

Definition 7 (Randomization Inference) Let Ti be a binary treatment variable for unit i where
i = 1, 2, . . . , n, and T be an n dimensional vector whose ith element is Ti . Then, P (T = t) defines
the randomized treatment assignment mechanism. Suppose that Y(t) represents an n dimensional
vector of fixed (but possibly unknown) potential outcomes when T = t, and Tobs represents the
observed treatment assignment. Then, randomization inference is defined as follows,

1. (Sharp Null Hypothesis) H0 : Yi (1) − Yi (0) = τ0 for all i with some fixed τ0 .

2. (Test Statistic) S(Y, T).

3. (p-value) P (S(Y, Tobs ) ≤ S(Y, T)) computed under the sharp null hypothesis.

A smaller value of the p-value indicates more (statistically) significant evidence against H0 .

4
Randomization inference described here is inference about the sample rather than the population.
The only source of randomness, therefore, comes from the randomized assignment of the treat-
ment, and this creates the reference distribution of the test statistic under the null hypothesis.
Randomization inference is distribution-free because it does not make a distributional assumption
about potential outcomes. It is also exact because it does not make any approximation. Moreover,
randomization inference respects the randomization procedure that was actually conducted in the
experiment, as the following application of randomization inference shows,

Example 4 (California Alphabet Lottery) Since 1975, California law has mandated that the
Secretary of State draw a random alphabet for each election to determine the order of candidates
for the first assembly district. The law further requires the candidate order to be systematically
rotated throughout the remaining assembly districts. Ho and Imai (2006) applies randomization
inference to the California alphabet lottery to investigate the ballot order effects in elections.
Alternatively, one can interpret randomization inference as inference about the infinite pop-
ulation from which a simple random sample is obtained. In this case, we assume that potential
outcomes, (Yi (1), Yi (0)), are sampled at random from a population, which is characterized by
marginal distributions P (Y (1)) and P (Y (0)), respectively. Now, the potential outcomes are ran-
dom variables. Then, the null hypothesis of no treatment effect is given by,

H0 : P (Y (1)) = P (Y (0)).

Under this null hypothesis, if each unit i gets assigned at random to the treatment or control group,
then the distribution of the observed outcome vector Y will be still the same for any treatment
assignment pattern. Therefore, each value of any test statistic, which corresponds to each treatment
assignment pattern, is equally likely. This argument suggests that the randomization inference as
described in Definition 7 can be viewed as inference about a population.
Finally, the scientific significance cannot be judged from the p-value, which can only tell us
statistical significance. To do this, we need to go beyond hypothesis testing.

2.2 Neyman’s Analysis of Randomized Experiments


Randomization inference described above is concerned about unit causal effects as defined in Defini-
tion 2. The null hypothesis says that the causal effect is zero for every unit, and the randomization
confidence interval is derived under the assumption of constant treatment effect. In contrast, Ney-
man (1923) considered inference about the sample average causal effect as defined in Definition 3.
Neyman showed that the difference-in-means estimator is unbiased for the sample average causal
effect and derived the expression for its variance.

Theorem 1 (Estimation of Sample Average Causal Effect) Consider a completely random-


ized experiment where 2n units are randomly selected into the treatment and control groups of equal
size. Let Ti be the binary treatment variable and Yi is the outcome. Consider the following estimator
of the sample average causal effect τ ,
2n
1X
τ̂ ≡ [Ti Yi − (1 − Ti )Yi ] .
n
i=1

Then,
S12 S02 S01
E(τ̂ ) = τ, and var(τ̂ ) = + + ,
2n 2n n

5
where S12 and S02 are the (sample) variance of the potential outcomes Yi (1) and Yi (0), respectively,
and S01 is their sample covariance.
Under randomization, the sample variances of Yi (1) and Yi (0) can be estimated without bias
using the sample variances of the observed outcomes for the treatment and control groups. However,
the sample covariance between the two potential outcomes cannot be estimated directly because
we never observe them jointly. Neyman (1923) further demonstrated that the standard estimator
of the variance of the average treatment effect is too conservative (i.e., too large).

Theorem 2 (Bounds for Varinace of Sample Average Causal Effect Estimator) If τ̂ rep-
resents the estimator of the average treatment effect defined in Theorem 1, then its variance satisfy
the following inequality,

S12 S02
var(τ̂ ) ≤ + ,
n n
where the upper bound is obtained under the constant treatment effect assumption.
So far we have focused on the estimation of sample causal quantities. Alternatively, we can
also consider the estimation of the population average causal effect as defined in Definition 4 by
thinking that the sample at hand comes from a population. It turns out that in this situation the
variance can be identified from the data.

Theorem 3 (Estimation of Population Average Causal Effect) Consider the same experi-
ment and estimator, τ̂ , as defined in Theorem 1 except that the potential outcomes (Yi (1), Yi (0)) are
a simple random sample from the population with marginal means (µ1 , µ0 ) and marginal variances
(σ12 , σ02 ). Consider the population average causal effect as the estimator, i.e., τ = µ1 − µ0 . Then,

σ12 σ22
E(τ̂ ) = τ, and var(τ̂ ) = + .
n n

Therefore, we can estimate the variance of τ̂ directly from the data without bias using the sample
variance of the observed outcomes for the treatment and control groups. Note that the variance of
the population estimator is greater than the variance of the sample estimator. This makes sense
because the former has an extra variability induced by random sampling from a population.
All the properties derived above are finite sample properties because they hold regardless of one’s
sample size. We are also interested in asymptotic (large sample) properties of a given estimator.
“How does a particular estimator behave as the sample size goes to infinity?” A “good” estimator
should converge to the true value of its estimand. We would also want to derive the asymptotic
distribution of the estimator so that an approximate variance of the estimator can be obtained.

Theorem 4 (Asymptotic Properties of the Difference-in-Means Estimator) Consider the


same setting as in Theorem 3 where we denote the difference-in-means estimator as τ̂n . Then,
p
1. (Consistency) τ̂n → τ .
√ d
2. (Asymptotic Normality) n(τ̂n − τ ) → N (0, σ12 + σ22 ).

Given the intuitions we developed through two particular examples, we now turn to the general
discussion of point estimation, hypothesis testing, and interval estimation.

6
3 Point Estimation
Building on the intuition we developed from Neyman’s approach, we study general principles of the
estimation of scientific quantities of interest. The estimation of any quantity involves uncertainty,
which needs to be quantified in every statistical estimation. One common way to quantify the
uncertainty of one’s estimate is to estimate its variance.

3.1 Nonparametric Estimation


The difference-in-means estimator is a simple and special case of so called nonparametric plug-
in estimators. The key idea of nonparametric estimation is to avoid as many assumptions as
possible. Neyman’s estimator is such an example because it does not make any assumption about
the distribution of potential outcomes. A simple but important nonparametric estimator is the
empirical CDF, which is discrete and puts 1/n probability mass at each realization of Xi . The key
properties of the empirical CDF are given now,

Theorem 5 (Properties of Empirical CDF) Let Xi with i = 1, 2, . . . , n be a simple random


sample from a population, which is characterized by the distribution function F , and let Fbn (x) =
1 Pn
n i=1 I{Xi ≤x} be the empirical CDF. Then, for any fixed x,

1. E(Fbn (x)) = F (x).


F (x)(1−F (x))
2. var(Fbn (x)) = n .
p
3. Fbn (x) → F (x).

Using the empirical CDF, we can come up with a class of nonparametric estimators, called non-
parametric plug-in estimator,

Definition 8 (Nonparametric Plug-in Estimator) Let Xi with i = 1, 2, . . . , n be a simple ran-


dom sample from a population, which is characterized by the distribution function F . If we define
a statistical functional θ = S(F ), then the nonparametric plug-in estimator of θ is given by,

θ̂n = S(Fbn ),

where Fbn (x) is the empirical distribution function.


In this section, we focus on a special type of statistical functionals.

Definition 9 (Linear Statistical Functional) Let F be an unknown distribution function, which


characterizes the data generating process. Then, a statistical functional of the form,
Z  R
θ = g(x) dF (x) = Pg(x)f (x) dx if X is continuous ,
x g(x)f (x) if X is discrete

is called a linear functional where f is the probability density or mass function corresponding to F .
Some examples will help understand the above definitions.

Example 5 Write mean, variance, and correlation as linear statistical functionals and then derive
their nonparametric plug-in estimators. Do they equal sample counterparts?

7
Given a linear functional, S(aF + bG) = aS(F ) + bS(G) holds for any distribution functions F, G
and constants a, b (hence, it’s name). The above definition also implies the nonparametric plug-in
estimator for a linear functional is in general given by
Z n
1X
S(Fn ) = g(x) dFn (x) =
b b g(xi ).
n
i=1

Under certain regularity conditions, it can be shown that a nonparametric plug-in estimator of a
linear statistical functional converges in probability to the true statistical functional. Furthermore,
one can also show the asymptotic normality of this plug-in estimator.

Theorem 6 (Asymptotic Properties of Plug-in Estimator for a Linear Functional) Let Xi


with i = 1, 2, . . . , n be a simple random sample fromR a population, which is characterized by the dis-
tribution function F . Suppose that θ = S(F ) = g(x)dF (x) is a linear functional and θ̂n = S(F̂n )
be its nonparametric plug-in estimator. Then,
√ d
1. n(θ̂n − θ) → N (0, v 2 ), where v 2 = [g(x) − θ]2 dF (x) is the asymptotic variance.
R

p
2. v̂n2 → v 2 where v̂n2 = n1 ni=1 [g(Xi ) − θ̂n ]2
P

p
3. θ̂n → θ.
Now, let’s apply this theorem to the examples we saw earlier.

Example 6 Apply Theorem 6 to the nonparametric plug-in estimators derived in Example 5.

3.2 Parametric Estimation


In most social science research, parametric analysis rather than nonparametric analysis is con-
ducted. This is in part because parametric estimation is relatively easier to understand and imple-
ment than nonparametric estimation, but it is also because the latter often requires a large amount
of data with high quality.
A formal way to consider the distinction between parametric and nonparametric estimation is
to say that in the former, we consider the data generating process to be characterized by the distri-
bution function F (; θ) with unknown parameter θ whose parameter space Θ is finite-dimensional.
This means that in parametric estimation, we only need to estimate θ and there is no need to esti-
mate the distribution function itself, F (because θ completely characterizes F ). One can imagine
that this will simplify the estimation problem significantly in many problems.
Here, we consider two general ways of conducting a parametric analysis using randomized
experiments as an example. The first is called the method of moments. This method is often
suboptimal, but its advantage is the ease of computation.

Definition 10 (Method of Moments) Let Xi with i = 1, 2, . . . , n be a simple random sample


from a population whose distribution function is F (; θ). The method of moments estimator, θ̂n ,
of the J-dimensional parameter vector θ, is defined to be the value of θ which is a solution to the
following system of J equations,
n Z
1X j
Xi = xj dF (x; θ).
n
i=1

for j = 1, 2, . . . , J.

8
Note that E(Xij ) = xj dF (x; θ). Let’s consider some examples.
R

Example 7 In the randomized classical experiment described in Theorem 1, construct the method
of moments estimator of the population average causal effect based on the following assumptions
about the marginal distributions of the potential outcomes.

1. Yi (1) ∼ N (µ1 , σ12 ) and Yi (0) ∼ N (µ2 , σ22 ).

2. Yi (1) ∼ Binom(k1 , π1 ) and Yi (0) ∼ Binom(k2 , π2 ).

In many cases, the method of moments estimator is poor and can be improved upon. But, the
method of moments estimator is consistent and asymptotically normal.

Theorem 7 (Asymptotic Properties of Method of MomentsR Estimator) Let θ̂n be the method
of moments estimator as defined in Definition 10. Let m(j) (θ) = xj dF (x; θ) represent the jth
moment of a random variable Xi for j = 1, 2, . . . , J. Then, θ̂n is a solution to the system of J
(j)
equations, m(θ) = m̂n where m̂n is the sample jth moment.
Assume that θ ∈ Θ and Θ ⊂ RJ is an open set. Further suppose that m : Θ 7→ RJ has non-zero
Jacobian at θ0 and is continuous and differentiable at θ0 where θ0 is the true value of θ. Then,

1. (Existence) θ̂n exists with probability tending to one.


p
2. (Consistency) θ̂n → θ0

3. (Asymptotic Normality)
√ d
n(θ̂n − θ0 ) → N (0, g(θ0 )var(Yi )g(θ0 )> )
−1
where Yi = (Xi , Xi2 , . . . , XiJ )> and g(θ) = (g1 (θ), g2 (θ0 ), . . . , gJ (θ))> with gj (θ) = ∂
∂θ mj (θ).

It is immediate that we can consistently estimate the asymptotic variance-covariance matrix by


" n #
1X
v̂n = g(θ̂n ) (Yi − Y )(Yi − Y ) g(θ̂n )> ,
>
n
i=1

where Y = ( ni=1 Xi /n, ni=1 Xi2 /n, . . . , ni=1 XiJ /n)> . Then, it follows that in a sufficiently large
P P P
sample, the method of moments has the following approximate sampling distribution,
 
approx. v̂n
θ̂n ∼ N θ0 , .
n

Finally, this implies that each parameter, i.e., each element of θ̂n , also has the approximately
normal sampling distribution in a sufficiently large n. Formally,
!
(i,i)
approx. (i) v̂n
θ̂n(i) ∼ N θ0 , ,
n

where
q the superscript denotes ith element of a vector or (i, i) element of a matrix. Note that
(i,i) (i)
v̂n /n is called the estimated asymptotic standard error of θn . It is important to note that

this derivation of asymptotic standard error can be applied to any n-consistent estimators. Let’s
derive the estimator for the asymptotic variance of the method of moments estimators.

9
Example 8 Derive the asymptotic variance of the method of moments estimators from Example 7
and consistent estimators of the resulting variances.
Another commonly used estimator maximizes the “likelihood” of observing the data you actu-
ally observed.

Definition 11 (Maximum Likelihood Estimator) Let Xi with i = 1, 2, . . . , n be a random


sample from the population with the probability density or mass function, f (x | θ). Then, the
likelihood function is defined as,
n
Y
L(θ | Xn ) = f (Xi | θ),
i=1
Pn
where Xn = (X1 , X2 , . . . , Xn ). The log-likelihood function is given by l(θ | Xn ) = i=1 log f (Xi |
θ). Finally, the maximum likelihood estimator is given by,
θ̂n = sup l(θ | Xn ).
θ∈Θ

Example 9 Find the maximum likelihood estimator in Example 7(1).


The maximum likelihood estimators have many desirable large-sample properties.

Theorem 8 (Asymptotic Properties of Maximum Likelihood Estimator) Consider the max-


imum likelihood estimator defined in Definition 11 and suppose that θ0 is the true value of θ. Under
certain regularity conditions,
p
1. (Consistency) θ̂n → θ0 .
2. (Invariance) g(θ̂n ) is also the maximum likelihood estimator of g(θ0 ) for any function g.
3. (Asymptotic Normality)
√ p
n(θ̂n − θ0 ) → N (0, I(θ0 )−1 ),
>
h i
∂ ∂

where I(θ0 ) is the expected Fisher information and is defined as E ∂θ l(θ | Xi )|θ=θ0 ∂θ l(θ | Xi )|θ=θ0 .

4. (Asymptotic Efficiency) Let θ̂n be any estimator of θ. Then,


>
∂2
   
∂ ∂
var(θ̂n ) ≥ − E(θ̂n ) E l(θ | Xi ) E(θ̂n )
∂θ ∂θ∂θ> θ=θ0 ∂θ
Using the same argument as we have done for the method of moments estimator, we can show
that for a sufficiently large sample, the maximum likelihood estimator has the following sampling
distribution, " −1 #
∂2

approx.
θ̂n ∼ N θ0 , − l(θ | Xn ) ,
∂θ∂θ> θ=θ̂n

where the variance is called observed Fisher information matrix and is equal to the minus inverse
of Hessian matrix. To prove this, you need to also show the information matrix equality,
" > #
∂2
  
∂ ∂
E l(θ | Xi ) l(θ | Xi ) = −E l(θ | Xi ) .
∂θ θ=θ0 ∂θ θ=θ0 ∂θ∂θ> θ=θ0

Example 10 Derive the asymptotic variance of the maximum likelihood estimator in the previous
example. Take the log transformation of the variance so that it is not bounded.

10
4 Interval Estimation
So far, we have used the variance as a measure of uncertainty about our estimates. However, we
also have seen that in many cases we know the asymptotic sampling distribution of our estimators.
Therefore, we can go a step further to produce another measure of uncertainty called a confidence
set (region), which covers the true value of the parameter with some probability.

Definition 12 (Confidence Sets) Let Xn = (X1 , X2 , . . . , Xn ) represent the data of sample size
n and θ0 be the true value of the parameter of interest, θ ∈ Θ. The (1 − α) confidence set is a set
C(Xn ), which satisfies the following equality,

inf Pθ0 (θ0 ∈ C(Xn )) = 1 − α,


θ0 ∈Θ

where 0 ≤ α ≤ 1 and Pθ0 (θ0 ∈ C(Xn )) is called the coverage probability.


If C(Xn ) is an interval, then it is called the (1 − α) confidence interval. One may construct (1 − α)
asymptotic confidence interval C(Xn ) such that inf θ0 ∈Θ limn→∞ Pθ0 (θ0 ∈ C(Xn )) = 1 − α. The
interpretation of confidence sets requires a caution.

• The (1 − α) confidence set is the set which contains the true value of the parameter with the
probability at least (1 − α).

• It is incorrect to say that the true value of the parameter lies in the (1 − α) confidence set
obtained from a particular data set at least (1 − α) × 100 percent of time.

Note that what is random is the confidence set and not the parameter, which is unknown but fixed.
The mean and variance alone do not give the confidence intervals. But, we can use an asymptotic
distribution of the estimator to come up with the confidence interval,

Theorem 9 (Normal-based Asymptotic Confidence Interval) Let Xn = (X1 , X2 , . . . , Xn )


represent the data of sample size n and θ0 be the true value of the parameter of interest, θ ∈ Θ.
Suppose that the asymptotic distribution of the estimator, θ̂n , of the parameter θ is given by,
√ d
n(θ̂n − θ0 ) → N (0, v 2 ),

where θ0 is the true value of θ. Then, the (1 − α) asymptotic confidence interval is given by

C(Xn ) = (θ̂n − zα/2 sd


.e., θ̂n + zα/2 sd
.e.),

where zα/2 = Φ−1 (1 − α/2) and Φ(·) is the distribution function of the standard normal random
variable. This confidence interval satisfies the following property,

lim P (θ0 ∈ C(Xn )) = 1 − α.


n→∞

Note that for α = 0.05, zα/2 ≈ 1.96. Applying this theorem, one can immediately derive the
confidence interval of the nonparametric plug-in, the method of moments, and maximum likelihood
estimators we studied earlier. Let’s apply it to Neyman’s estimator.

Example 11 (Confidence Intervals for the Population Average Causal Effect) Construct
the (1 − α) asymptotic confidence intervals for the estimator of the population average causal effect.

11
5 Statistical Hypothesis Testing
5.1 General Concepts
Using Fisher’s randomization inference as a motivating example, we consider statistical hypothesis
tests more generally. Fisher used his p-value as a measure of evidence against the null hypothesis.
We can push his argument a bit further, and come up with a procedure which we use to reject or
retain the proposed null hypothesis.

Definition 13 (Hypothesis Test) Let θ be a parameter of interest and Θ be its parameter space.
Suppose that we wish to test the null hypothesis H0 : θ ∈ Θ0 against the alternative H1 : θ ∈ Θ \ Θ0 ,
using the test statistic S(Xn ) where Xn = (X1 , X2 , . . . , Xn ) represents the data of sample size n.
Then, the hypothesis test is defined by specifying the rejection region R where if S(Xn ) ∈ R, then
we reject H0 , and if S(Xn ) ∈/ R, then we retain H0 . Typically, the rejection region is defined as
R = (c, ∞) where c is called a critical value.
A null hypothesis of the form θ = θ0 is called a simple null hypothesis, whereas θ > θ0 , θ < θ0 , etc.
are called a composite null hypothesis. While the choice of the null hypothesis should be based on
one’s scientific research question of interest (Fisher used τ0 = 0), the
Pchoice of test statistic should
n
be governed by its statistical properties. Fisher used S(Yn , Tn ) = i=1 Ti Yi + (1 − Ti )(1 − Yi ) as
his test statistic, but other choices are also possible.
We can further investigate the consequence of choosing a particular test statistic on the perfor-
mance of the hypothesis testing procedure. In particular, when conducting a hypothesis test, one
can make two types of mistakes, which are called Type I error and Type II error,
Reject H0 Retain H0
H0 is true Type I error Correct
H0 is false Correct Type II error
Of course, we would like our hypothesis testing procedure to minimize those two types of errors.
But there is an inherent trade-off between the two errors. If you always reject H0 , then you
eliminate Type II error but maximize the possibility of Type I error. A way to get around this is to
minimize one type of error without increasing the other type. As in Fisher’s example, hypothesis
tests are conducted typically by deriving the distribution of test statistics under the null hypothesis.
Therefore, the probability of committing Type I error can be chosen by data analysts. Given this
fixed probability of Type I error, we may try to minimize the probability of committing Type II
error. To formalize this idea, we introduce the following concepts,

Definition 14 (Power and Size of Hypothesis Tests) Consider a hypothesis test defined by
the null hypothesis, H0 : θ ∈ Θ0 , the test statistic S(Xn ), and the rejection region R where Xn =
(X1 , X2 , . . . , Xn ) is the data of sample size n. Then,

1. The power function of the test is β(θ) = Pθ (S(Xn ) ∈ R).

2. The size of the test is α = supθ∈Θ0 β(θ).

3. A level α test is a hypothesis test whose size is less than or equal to α.

In words, the power of a test is the probability that one rejects the null, while the size of a test is
the largest probability of rejecting H0 when H0 is true.
According to the logic described above, we wish to find a test which has the largest power
under the alternative hypothesis among all size α test. If you can find such a test, which is often a

12
difficult thing to do, a test is called most powerful. If the test is most powerful at all values of α,
then it is called uniformly most powerful. We do not go into the details of how to find such tests,
but the above discussion offers the following lesson about the general interpretation of hypothesis
testing,

• A failure to reject the null hypothesis may arise from the lack of power of your hypothesis
testing procedure rather than the fact that the null hypothesis is true.

Let’s make sure that we understand the concepts using the following simple example.

Example 12 (One sample t-test) Asumme that Xi ∼ N (µ, σ 2 ) for i = 1, 2, . . . , n. In our causal
inference example, we may assume that the marginal distribution of each of the two potential
outcomes is Normal and test whether its mean is less than some specified value. Derive a level α
test and its power function for each of the following cases,

1. to test H0 : µ = µ0 against H1 : µ 6= µ0 .

2. to test H0 : µ ≤ µ0 against H1 : µ > µ0 .

Finally, we give the general definition and interpretation of p-value.

Definition 15 (p-value) Let Xn = (X1 , X2 , . . . , Xn ) represent the data of sample size n. Con-
sider a test statistic S(Xn ) and its observed value S(Xobs
n ) given the null hypothesis H0 : θ ∈ Θ0
where Θ0 is a subset of the parameter space Θ of θ. Then, the p-value, p(Xobs n ), is equal to,

p(Xobs obs
n ) = sup Pθ (S(Xn ) ≥ S(Xn )).
θ∈Θ0

If the null hypothesis is simple, i.e., Θ0 = {θ0 }, then the p-value equals Pθ0 (S(X) ≥ S(Xobs
n )). In
general, one needs to be careful about the interpretation of the p-value.

• The p-value is the probability, computed under the null hypothesis, of observing a value of
the test statistic at least as extreme as the value actually observed.

• The p-value is not the probability that the null hypothesis is true.

• A large p-value can occur either because the null hypothesis is true or because the null
hypothesis is false but the test is not powerful.

• The statistical significance indicated by the p-value does not necessarily imply scientific sig-
nificance.

The first statement corresponds exactly to Fisher’s formulation and this is what exactly the defi-
nition of the p-value says. We also realize that the p-value is a function of the data and so can be
seen as a statistic. Then, we can derive the distribution of the p-value under the null hypothesis,

Theorem 10 (Distribution of the p-value) Consider a size α test, which is defined by the null
hypothesis H0 : θ ∈ Θ0 , the rejection region, Rα = (cα , ∞), and the test statistic, S(Xn ), where
Xn = (X1 , X2 , . . . , Xn ) represents the data of sample size n. Then, the distribution of the p-value
under the null is stochastically greater than or equal to Uniform(0,1). That is,

P (p(Xn ) ≤ α) ≤ α,

for α ∈ [0, 1].

13
An important special case is where the null hypothesis is simple and the test statistic is continuous.
In that case, the reference distribution of the p-value is Uniform(0,1). To understand the implication
of this theorem, consider rejecting/retaining the null hypothesis based on the p-value. That is, we
use the p-value itself as a test statistic and reject H0 when the p-value is less than or equal to α.
Then, the probability of committing Type I error is also less than or equal to α. (Why?) This
implies,

• the p-value is the smallest level at which we can reject the null hypothesis.

or more formally the p-value is equal to inf{α : S(Xn ) ∈ Rα }.

5.2 Finding Tests


There are many ways of finding a test. Fisher’s randomization test is an example of nonparametric
tests where no distributional assumption is made. It is also an example of exact tests because no
approximation is made to derive the reference distribution of a test statistic. Other nonparametric
tests include χ2 tests of independence. Alternatively, one can also consider parametric tests where
the reference distribution of test statistic is derived based on some parametric assumptions. For
example, one sample t test we studied earlier is such an example. Let’s review it in the context of
causal inference with randomized experiments, which is often called paird t-test.

Example 13 (Paired t-test) Let τ be the population average causal effect in a randomized ex-
periment with a matched-pair design where two units are paired on the basis of observed covariates
and the randomization of the treatment is conducted within each pair. Suppose that we wish to test
H0 : τ = τ0 against H1 : τ 6= τ0 . Derive a level α test assuming the differences of the observed
outcomes between each pair of the treated and control units are normally distributed.
Another example of parametric tests is the two-sample t-test.

Example 14 (Two-sample t-test) Consider two independent random samples from different
Normal distributions, i.e., Xi ∼ N (µX , σX 2 ) for i = 1, . . . , n 2
X and Yi ∼ N (µY , σY ). Construct
the level α test for each of the following cases,

1. to test H0 : µX ≤ µY against H1 : µX > µY .

2. to test H0 : µX = µY against H1 : µX 6= µY .

In the context of causal inference with randomized experiments, one may assume that independent
random samples are drawn from the population of the treated group and that of the control group.
These tests may give you misleading results if the underlying distributional assumptions are
violated. One can easily check from the observed data whether this normality assumption is
reasonable (and there is a way to test this formally (e.g., Kolmogorov-Smirnov Test). Now, we
may wonder if we can avoid this distributional assumption and conduct a hypothesis test by
using the Neyman’s nonparametric estimator. This is easy to do because we know the asymptotic
distribution of the estimator. In fact, the use of asymptotic sampling distribution is a very general
way to construct a hypothesis test and works for both nonparametric and parametric tests.

Definition 16 (Wald Test) Consider a simple null hypothesis of the form, H0 : θ = θ0 , against
the alternative, H1 : θ 6= θ0 . Assume that an estimator θ̂n is asymptotically normal.

n(θ̂n − θ0 ) → N (0, v 2 ),

14
where v 2 is the asymptotic variance. Then, the size α Wald test rejects H0 if and only if

θ̂n − θ0
> zα/2 ,
sd
.e.

where zα = Φ−1 (1 − α) and Φ(·) is the inverse


p of the standard normal distribution function and
s.e.
c is the estimated standard error, i.e., v̂/n.
We now apply the Wald test to causal inference with randomized experiments.

Example 15 (Wald Test based on the Neyman’s Estimator) In the context of classical ran-
domized experiments, construct a (1 − α) level hypothesis test for the null hypothesis H0 : τ = τ0
where τ is the population average causal effect.
Note that Wald tests rely on the asymptotic sampling distribution of an estimator. This means
that Wald tests are valid level α test only asymptotically,

Theorem 11 (Asymptotic Property of Wald Test) Consider Wald test as defined in Defini-
tion 16. Then, under the null hypothesis,
!
θ̂n − θ0
lim Pθ0 > zα/2 = 1 − α,
n→∞ s.e.
c

where s.e.
c is the estimated asymptotic standard error of θ̂n .
For likelihood inference, there is another way of constructing a hypothesis test by using the
likelihood itself rather than using the asymptotic sampling distribution of the ML estimator,

Definition 17 (Likelihood Ratio Test) Consider a hypothesis test where the null hypothesis
H0 : θ ∈ Θ0 is tested against the alternative hypothesis H0 : θ ∈ (Θ \ Θ0 ). The likelihood ratio
statistic is given by,

supθ∈Θ L(θ | Xn )
λ(Xn ) = 2 log = 2[l(θ̂n | Xn ) − l(θ̃n | Xn )],
supθ∈Θ0 L(θ | Xn )

where Xn is the data of sample size n, and θ̂n and θ̃n represent the unrestricted and restricted
maximum likelihood estimates. respectively.
One can derive the asymptotic distribution of the likelihood ratio test statistic,

Theorem 12 (Asymptotic Distribution of Likelihood Ratio Test Statistic) Consider the


(i)
likelihood ratio test defined in Definition 17 where Θ0 = {θ : θ(i) = θ0 for some i}. Then, under
the null hypothesis H0 : θ ∈ Θ0 , we have
d.
λ(Xn ) → χ2ν ,

where ν is the dimension of Θ minus the dimension of Θ0 . The p-value is given by P (Z ≥ λ(Xn ))
where Z is distributed as χ2ν .
Finally, there is a close relationship between hypothesis testing and confidence interval. Indeed,
one way to obtain a (1 − α) confidence interval is to “invert” a (1 − α) level test.

15
Theorem 13 (Inverting the Test) Consider a level α hypothesis test defined by the null hy-
pothesis H0 : θ = θ0 and the rejection region Rθ0 . Define a set C(X) = {θ0 : X ∈
/ Rθ0 } where X
represents the data. Then, C(X) is a (1 − α) confidence set. The converse also holds.
The theorem implies that randomization-based confidence sets can be constructed by inverting
randomization test,

Example 16 (Randomization-Based Confidence Sets) Assume the constant treatment ef-


fect, i.e., Yi (1) − Yi (0) = τ for all i = 1, 2, . . . , n where τ is a fixed but unknown parameter.
How does one obtain the (1 − α) confidence set about τ ?

16
References
Fisher, R. A. (1935). The Design of Experiments. Oliver and Boyd, London.

Ho, D. E. and Imai, K. (2006). Randomization inference with natural experiments: An analysis of
ballot effects in the 2003 california recall election. Journal of the American Statistical Association
101, 475, 888–900.

Holland, P. W. (1986). Statistics and causal inference (with discussion). Journal of the American
Statistical Association 81, 945–960.

Imbens, G. W. (2004). Nonparametric estimation of average treatment effects under exogeneity:


A review. Review of Economics and Statistics 86, 1, 4–29.

Neyman, J. (1923). On the application of probability theory to agricultural experiments: Essay on


principles, section 9. (translated in 1990). Statistical Science 5, 465–480.

17

You might also like