ECONS303: Applied Quantitative Research Methods
Lecture set 2: Linear Regression with One
Regressor
“An equation contains the possibility of the state of affairs in reality. The logical structure of the economy which the model
shows, mathematics shows in equations. (When I speak of mathematics here, I include its sub-branch of statistics)”
- Choy, Keen Meng (2020) Tractatus Modellus-Philosophicus, mimeo
Outline
1. The population linear regression model
2. The ordinary least squares (OLS) estimator and the sample
regression line
3. Measures of fit of the sample regression
4. The least squares assumptions for causal inference
5. The sampling distribution of the OLS estimator
6. The least squares assumptions for prediction
Linear regression lets us estimate the
population regression line and its slope.
• The population regression line is the expected value of Y given
X.
• The slope is the difference in the expected values of Y, for two
values of X that differ by one unit
• The estimated regression can be used either for:
– causal inference (learning about the causal effect on Y of a change in
X)
– prediction (predicting the value of Y given X, for an observation not in
the data set)
The problem of statistical inference for linear regression is, at a general
level, the same as for estimation of the mean or of the differences between
two means. Statistical, or econometric, inference about the slope entails:
• Estimation:
– How should we draw a line through the data to estimate the population
slope?
Answer: ordinary least squares (OLS).
– What are advantages and disadvantages of OLS?
• Hypothesis testing:
– How to test whether the slope is zero?
• Confidence intervals:
– How to construct a confidence interval for the slope?
The Linear Regression Model (SW Section 4.1)
The population regression line:
Test Score = β0 + β1STR
β1 = slope of population regression line
Why are β0 and β1 “population” parameters?
• We would like to know the population value of β1.
• We don’t know β1, so must estimate it using data.
The Population Linear Regression Model
Yi = β0 + β1Xi + ui, i = 1,…, n
• We have n observations, (Xi, Yi), i = 1,.., n.
• X is the independent variable or regressor
• Y is the dependent variable
• β0 = intercept
• β1 = slope or the coefficient estimate
• ui = the regression error
• The regression error consists of omitted factors. In general, these omitted
factors are other factors that influence Y, other than the variable X. The
regression error also includes error in the measurement of Y.
The population regression model in a picture: Observations on Y and X
(n = 7); the population regression line; and the regression error (the
“error term”):
The Ordinary Least Squares Estimator
(SW Section 4.2)
How can we estimate β0 and β1 from data?
We use the least squares (“ordinary least squares” or “OLS”)
estimator of the unknown parameters β0 and β1. The OLS
estimator require solving the following minimization problem
n
min b0 ,b1 [Yi (b0 b1 X i )]2
i 1
n
The OLS estimator solves: min b0 ,b1 [Yi (b0 b1 X i )]2
i 1
• The OLS estimator minimizes the average squared
difference between the actual values of Yi and the prediction
(“predicted value”) based on the estimated line.
• This minimization problem can be solved using calculus
(App. 4.2).
• The result is the OLS estimators of β0 and β1.
Example:
The population regression line: E(Test Score|STR) = y = β0 + β1STR
Mechanics of OLS
Key Concept 4.2: The OLS Estimator,
Predicted Values, and Residuals
The OLS estimators of the slope β1 and the intercept β0 are
n
(X i X )(Yi Y )
s XY
ˆ1 i 1
n
(4.7)
s X2
i
( X
i 1
X ) 2
ˆ0 Y ˆ1 X . (4.8)
The OLS predicted values Yˆi and residuals uˆi are
Yˆi ˆ0 ˆ1 X i , i 1,..., n (4.9)
uˆ Y Yˆ , i 1,..., n.
i i i (4.10)
The estimated intercept (ˆ0 ), slope (ˆ1 ), and residual (uˆi ) are computed
from a sample of n observations of X i and Yi , i 1,..., n. These are estimates
of the unknown true population intercept (0 ), slope (1 ), and error term (ui ).
Application to the California Test Score –
Class Size data
• Estimated slope ˆ1 2.28
• Estimated intercept ˆ0 698.9
• Estimated regression line: TestScore 698.9 2.28 STR
Interpretation of the estimated slope and
intercept
• TestScore 698.9 2.28 STR
• Districts with one more student per teacher on average have
test scores that are 2.28 points lower.
E (Test score|STR )
• That is, 2.28
STR
• The intercept (taken literally) means that, according to this estimated line,
districts with zero STR would have a (predicted) test score of 698.9. But this
interpretation of the intercept makes no sense – it extrapolates the line outside
the range of the data – here, the intercept is not economically meaningful.
Predicted values & residuals:
One of the districts in the data set is Antelope, CA, for
which STR = 19.33 and Test Score = 657.8
predicted value: Yˆ
Antelope 698.9 – 2.28 19.33 654.8
residual: uˆ Antelope 657.8 – 654.8 3.0
OLS regression: STATA output
regress testscr str, robust
Regression with robust standard errors Number of obs = 420
F( 1, 418) = 19.26
Prob > F = 0.0000
R-squared = 0.0512
Root MSE = 18.581
-------------------------------------------------------------------------
| Robust
testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------+----------------------------------------------------------------
str | -2.279808 .5194892 -4.39 0.000 -3.300945 -1.258671
_cons | 698.933 10.36436 67.44 0.000 678.5602 719.3057
-------------------------------------------------------------------------
TestScore 698.9 – 2.28 STR
(We’ll discuss the rest of this output later.)
Measures of Fit (SW Section 4.3)
Two regression statistics provide complementary measures of
how well the regression line “fits” or explains the data:
• The regression R2 measures the fraction of the variance of Y
that is explained by X; it is unitless and ranges between zero
(no fit) and one (perfect fit)
• The standard error of the regression (SER) measures the
magnitude of a typical regression residual in the units of Y.
The regression R2 is the fraction of the sample
variance of Yi “explained” by the regression.
Yi Yˆi uˆi OLS prediction OLS residual
sample var (Y ) sample var(Yˆ ) sample var(uˆ )
i i
total sum of squares “explained” SS “residual” SS
n
ESS i
(Yˆ Yˆ ) 2
Definition of R 2: R2 i 1
n
i
TSS
(Y Y ) 2
i 1
• R2 = 0 means ESS = 0
• R2 = 1 means ESS = TSS
• 0 ≤ R2 ≤ 1
• For regression with a single X, R2 = the square of the correlation coefficient
between X and Y
The Standard Error of the Regression (SER)
The SER measures the spread of the distribution of u. The SER is
(almost) the sample standard deviation of the OLS residuals:
1 n
SER i
n 2 i 1
(uˆ ˆ
u ) 2
1 n 2
n 2 i 1
uˆi
1 n
The second equality holds because uˆ uˆi 0.
n i 1
n
1
SER
n 2 i 1
ˆ
ui2
The SER:
• has the units of u, which are the units of Y
• measures the average “size” of the OLS residual (the average “mistake” made
by the OLS regression line)
The root mean squared error (RMSE) is closely related to the
SER:
1 n 2
RMSE
n i 1
uˆi
This measures the same thing as the SER – the minor difference
is division by 1/n instead of 1/(n – 2).
Technical note: why divide by n – 2 instead
of n – 1?
1 n 2
SER
n 2 i 1
uˆi
• Division by n 2 is a “degrees of freedom” correction just like division by n 1 in
the sample vatriance sY2 , except that for the SER, two parameters
have been estimated ( 0 and 1 , by ˆ0 and ˆ1 ), whereas in sY2
only one has been estimated (Y , by Y ).
• When n is large, it doesn’t matter whether n, n – 1, or n – 2 are used –
although the conventional formula uses n – 2 when there is a single regressor.
• For details, see Section 18.4
The Least Squares Assumptions for
Causal Inference (SW Section 4.4)
• So far we have treated OLS as a way to draw a straight line
through the data on Y and X. Under what conditions does the
slope of this line have a causal interpretation? That is, when will
the OLS estimator be unbiased for the causal effect on Y of X?
• What is the variance of the OLS estimator over repeated
samples?
• To answer these questions, we need to make some assumptions
about how Y and X are related to each other, and about how they
are collected (the sampling scheme)
These assumptions – there are three – are known as the Least
Squares Assumptions for Causal Inference.
Definition of Causal Effect
• The causal effect on Y of a unit change in X is the expected
difference in Y as measured in a randomized controlled
experiment
– For a binary treatment, the causal effect is the expected difference in
means between the treatment and control groups, as discussed in Ch. 3.
• The least squares assumptions for causal inference generalize the
binary treatment case to regression.
The Least Squares Assumptions for Causal
Inference
Let β1 be the causal effect on Y of a change in X:
Yi = β0 + β1Xi + ui, i = 1,…, n
1. The conditional distribution of u given X has mean zero, that
is, E(u|X = x) = 0.
This implies that ˆ1 is unbiased for the causal effect 1
2. (Xi,Yi), i = 1,…, n, are i.i.d.
– This is true if (X, Y) are collected by simple random sampling
This delivers the sampling distribution of ˆ0 and ˆ1
3. Large outliers in X and/or Y are rare.
– Technically, X and Y have finite fourth moments
Outliers can result in meaningless values of ˆ1
Least squares assumption #1: E(u|X = x) = 0. (1 of 2)
When β1 is the
causal effect,
for any given
value of X, the
mean of u is
zero:
Example: Test Scorei = β0 + β1STRi + ui; ui = other factors
• What are some of these “other factors”?
Least squares assumption (LSA) #1:
E(u|X = x) = 0. (2 of 2)
• The benchmark for understanding this assumption is to consider an ideal
randomized controlled experiment:
• X is randomly assigned to people (students randomly assigned to different size
classes; patients randomly assigned to medical treatments). Randomization is
done by computer – using no information about the individual.
• Because X is assigned randomly, all other individual characteristics – the
things that make up u – are distributed independently of X, so u and X are
independent
Least squares assumption (LSA) #1:
E(u|X = x) = 0. (2 of 2)
• Thus, in an ideal randomized controlled experiment, E(u|X = x) = 0 (that is,
LSA #1 holds)
• In actual experiments, or with observational data, we will need to think hard
about whether E(u|X = x) = 0 holds.
Least squares assumption #2: (Xi,Yi),
i = 1,…,n are i.i.d.
This arises automatically if the entity (individual, district) is
sampled by simple random sampling:
• The entities are selected from the same population, so (Xi, Yi) are
identically distributed for all i = 1,…, n.
• The entities are selected at random, so the values of (X, Y) for different
entities are independently distributed.
Note: The main place we will encounter non-i.i.d. sampling is when data are
recorded over time for the same entity (panel data and time series data).
* We will deal with that complication when we cover panel data.
Least squares assumption #3: Large outliers are rare
Technical statement: E(X4) < ∞
This is because OLS can be sensitive to an outlier:
• A large outlier is an extreme value of X or Y.
• Is the lone point an outlier in X or Y?
• In practice, outliers are often data glitches (coding or recording problems).
Sometimes they are observations that really shouldn’t be in your data set.
Plot your data!
Least squares assumption #3: Large outliers are rare
Technical statement: E(X4) < ∞
• On a technical level, if X and Y are bounded, then they have
finite fourth moments. (Standardized test scores automatically
satisfy this; STR, family income, etc. satisfy this too.)
• The substance of this assumption is that a large outlier can
strongly influence the results – so we need to rule out large
outliers.
Before conducting any formal analysis, look at your data!
If you have a large outlier, is it a typo? Does it belong in your
data set? Why is it an outlier?
The Sampling Distribution of the OLS
Estimator (SW Section 4.5)
The OLS estimator is computed from a sample of data. A different sample
yields a different value of ˆ .This is the source of the “sampling uncertainty”
1
of ˆ . We want to:
1
• quantify the sampling uncertainty associated with ˆ1
• use ˆ1 to test hypotheses such as 1 0
• construct a confidence interval for β1
• All these require figuring out the sampling distribution of the
OLS estimator. Two steps to get there…
– Probability framework for linear regression
– Distribution of the OLS estimator
Probability Framework for Linear Regression
The probability framework for linear regression is summarized by the
three least squares assumptions.
Population
• The group of interest (e.g.: all possible school districts)
Random variables: Y, X
• E,g,: (Test Score, STR)
Joint distribution of (Y, X). We assume:
• The population regression function is linear
• E(u| X) = 0 (1st Least Squares Assumption (LSA))
• X, Y have nonzero finite fourth moments (3rd L.S.A.)
Data Collection by simple random sampling implies:
• {(Xi, Yi)}, i = 1,…, n, are i.i.d. (2nd L.S.A.)
The Sampling Distribution of ˆ
• Like Y , ˆ1 has a sampling distribution.
• What is E (ˆ )?
1
If E ( ˆ1 ) 1 , then OLS is unbiased a good thing!
• What is var(ˆ1 )? (measure of sampling uncertainty)
We need to derive a formula so we can compute the
standard error of 1.
• What is the distribution of ˆ1 in small samples?
It is very complicated in general
However in large samples, ˆ is normally distributed.
1
The mean and variance of the sampling
distribution of ˆ
Yi = β0 + β1Xi + ui
Y 0 1 X u
SO Yi Y 1 ( X i X ) (ui u )
Thus,
n n
(X i X )(Yi Y ) (X i X )[ 1 ( X i X ) (ui u )]
ˆ1 i 1
n
i 1
n
(Xi X )
i 1
2
i
( X
i 1
X ) 2
The mean and variance of the sampling
distribution of ˆ (3 of 3)
After some algebraic manipulation (see Appendix at the end
of the slides), we obtain the following
(X i X )ui
ˆ1 1 i 1
n
i
( X
i 1
X ) 2
Now we can calculate E ( ˆ1 ) and var ( ˆ1 ):
n
i ( X X )u i
E ( ˆ1 ) 1 E n i 1
( X X )2
i 1
i
Using the law of iterated
n Expectation which states that
( X i X )u i the expected value of a RV is
equal to the sum of the
E E i n1 X 1 ,..., X n expected values of that RV
( X i X )2 conditioned on a second
i 1 random variable.
0 because E ( ui | X i x ) 0 by LSA #1
• Thus LSA #1 implies that E ( ˆ ) 1 1
• That is, ˆ1 is an unbiased estimator of 1 .
• For details see App. 4.3
Next calculate var ( ˆ1 ) (1 of 2)
write
n
1 n
( X i X )u i
n i 1
vi
ˆ1 1 i n1
n 1 2
i 1
(Xi X ) 2
n
sX
n 1
where vi ( X i X )ui . If n is large, s X2 X2 and 1, so
n
1 n
n i 1
vi
ˆ1 1 ,
2
X
where vi ( X i X )ui (see App. 4.3). Thus,
Next calculate var ( ˆ1 ) (2 of 2)
1 n
n i 1
vi
ˆ1 1
X2
1 n
var(vi )/n
so var( 1 1 ) var( 1 ) var vi ( X )
ˆ ˆ 2 2
n i 1 ( X2 ) 2
where the final equality uses assumption 2. Thus,
ˆ 1 var[( X i x )ui ]
var( 1 ) .
n ( X )
2 2
Summary so far
1. ˆ1 is unbiased: under LSA#1, E ( ˆ1 ) 1 (just like Y )
2. var( ˆ ) is inversely proportional to n ( just like Y )
1
What is the sampling distribution of ˆ1?
The exact sampling distribution is complicated – it depends on
the population distribution of (Y, X) – but when n is large we get
some simple (and good) approximations:
p
1) Because var( ˆ1 ) 1/n and E ( ˆ1 ) 1 , ˆ1 1
2) When n is large, the sampling distribution of ˆ1 is well
approximated by a normal distribution (CLT)
Large-n approximation to the
distribution of ˆ1:
1 n 1 n
n i 1
vi
n
vi
ˆ1 1 i 21 , where vi ( X i X )ui
n 1 2 X
SX
n
• When n is large, vi ( X i X )ui ( X i X )ui , which is i.i.d.
and var(vi ) .
1 n
So, by the CLT, vi is approximately distributed N (0, v2 /n).
n i 1
• Thus, for n large, ˆ1s approximately distributed
ˆ 2
1 ~ N 1 , v
2 2
, where vi ( X i X )ui
n( X )
The larger the variance of X , the smaller
the variance of ˆ1
The math
ˆ 1 var[( X i x )ui ]
var( 1 1 )
n ( X2 ) 2
Where X2 var( X i ). The variance of X appears (squared) in the
denominator so increasing the spread of X decreases the variance
of 1.
The intuition
If there is more variation in X, then there is more information in
the data that you can use to fit the regression line. This is most
easily seen in a figure…
Summary of the sampling distribution of ˆ1 :
If the three Least Squares Assumptions hold, then
• The exact (finite sample) sampling distribution of ˆ1 has:
E ( ˆ1 ) 1 (that is, ˆ1 is unbiased)
ˆ 1 var[( X i x )ui ] 1
var( 1 ) .
n X 4
n
• Other than its mean and variance, the exact distribution of ˆ1 is
complicated and depends on the distribution of (X , u )
p
• ˆ1 1 (that is, ˆ1 is consistent)
ˆ1 E ( ˆ1 )
• When n is large, ~ N (0, 1) (CLT)
var( ˆ1 )
Note : This is similar to the sampling distribution of Y .
Κey Concept 4.4: Large-Sample Distributions
of ˆ0 and ˆ1
If the least squares assumptions in Key Concept 4.3 hold, then in
large samples ˆ and ˆ have a jointly normal sampling distribution.
0 1
The large-sample normal distribution of slope parameter ˆ1 is N ( 1 , 2ˆ ), where the
1
variance of this distribution, 2ˆ , is
1
1 var[( X i X )ui ]
2ˆ 2
. (4.21)
1
n [var( X i )]
The large-sample normal distribution of the intercept ˆ0 is N ( 0 , 2ˆ ), where
0
1 var( H i ui ) X
2ˆ 2 2
, where H i 1 2
Xi. (4.22)
n [ E ( H i )] E( X i )
0
The Least Squares Assumptions for Prediction
(SW Appendix 4.4) (1 of 2)
• Prediction entails using an estimation sample to estimate a
prediction model, then using that model to predict the value
of Y for an observation not in the estimation sample.
– Prediction requires good out-of-sample performance.
• For prediction, β1 is simply the slope of the population
regression line (the conditional expectation of Y given X),
which in general is not the causal effect.
• The critical LSA for Prediction is that the out-of-sample
(“OOS”) observation for which you want to predict Y comes
from the same distribution as the data used to estimate the
model.
– This replaces LSA#1 for Causal Inference
The Least Squares Assumptions for Prediction
(SW Appendix 4.4) (2 of 2)
1. The out of sample observation (XOOS,YOOS) is drawn from
the same distribution as the estimation sample (Xi,Yi), i =
1,…,n
– This ensures that the regression line fit using the estimation sample
also applies to the out-of-sample data to be predicted.
2. (Xi,Yi), i = 1,…, n are i.i.d.
– This is the same as LSA#2 for causal inference
3. Large outliers in X and/or Y are rare (X and Y have finite
fourth moments)
– This is the same as LSA#3 for causal inference
* In this book, the assumption that large outliers are unlikely is made mathematically
precise by assuming that X and Y have nonzero finite fourth moments.
APPENDIX
The mean and variance of the sampling
distribution of ˆ (1 of 3)
Some algebra:
Yi = β0 + β1Xi + ui
Y 0 1 X u
SO Yi Y 1 ( X i X ) (ui u )
Thus,
n n
(X i X )(Yi Y ) (X i X )[ 1 ( X i X ) (ui u )]
ˆ1 i 1
n
i 1
n
(Xi X )
i 1
2
i
( X
i 1
X ) 2
The mean and variance of the sampling
distribution of ˆ (2 of 3)
n n
(X i X )( X i X ) (X i X )(ui u )
ˆ1 1 i 1
n
i 1
n
(X
i 1
i X) 2
(X
i 1
i X )2
n
(X i X )(ui u )
SO ˆ1 1 i 1
n
.
i
( X
i 1
X ) 2
n
n
n
Now
i 1
( X i X )(ui u )
i 1
( X i X )ui i
i 1
( X X ) u
n
n
( X i X )ui X i nX u
i 1 i 1
n
( X i X )ui
i 1
The mean and variance of the sampling
distribution of ˆ (3 of 3)
n n
Substitute ( X i X )(ui u ) ( X i X )ui into the expression
i 1 i 1
for ˆ1 1 :
n
(X i X )(ui u )
ˆ1 1 i 1
n
i
( X
i 1
X ) 2
(X i X )ui
SO ˆ1 1 i 1
n
i
( X
i 1
X ) 2