Chapter 2
An Overview of the
classical linear regression model
Instructor: Badassa Wolteji (PhD)
1
Regression
Regression is probably the single most important tool at
the econometrician’s disposal.
But what is regression analysis?
It is concerned with describing and evaluating the
relationship between a given variable (dependent variable)
and one or more other variables (independent variable(s)).
2
Terminology and notation
Denote the dependent variable by y and the
independent variable(s) by x1, x2,..., xk where there are
k independent variables.
Note that there can be many x variables but we will
limit ourselves to the case where there is only one x
variable to start with.
In our set-up, there is only one y variable.
We later include more x’s – the multiple regression
case
3
Terminology and notation
In the literature the terms dependent
variable and explanatory variable are
described variously. A representative list
is:
Regression is different from Correlation
If we say y and x are correlated, it means that we are
treating y and x in a completely symmetrical way.
In regression, we treat the dependent variable (y) and
the independent variable(s) (x’s) very differently.
The y variable is assumed to be random or
“stochastic” i.e. it has a probability distribution.
The x variables are, however, assumed to have fixed
(“non-stochastic”) values in repeated samples.
5
Regression versus correlation
Although regression analysis deals with the
dependence of one variable on other variables ,it
does not necessarily imply causation.
In the crop-yield association, there is no
statistical reason to assume that rainfall does not
depend on crop yield.
6
Regression versus correlation
The fact that we treat crop yield as dependent on
rainfall (among other things) is due to non-
statistical considerations.
Common sense suggests that the relationship
cannot be reversed, for we cannot control rainfall
by varying crop yield.
A statistical relationship in itself cannot logically
imply causation.
To ascribe causality, one must appeal to a priori
or theoretical considerations.
7
Regression versus correlation
In correlation analysis, the primary objective is to
measure the strength or degree of linear association
between two variables.
In regression analysis, we try to estimate or predict
the average value of one variable on the basis of the
fixed values of other variables.
Regression and correlation have some fundamental
differences.
In regression analysis there is an asymmetry in the
way the dependent and explanatory variables are
treated.
8
Regression versus correlation
In correlation analysis, we treat any variables
symmetrically; there is no distinction between the
dependent and explanatory variables.
Whereas most of the regression theory to be dealt
with here is conditional upon the assumption that the
dependent variable is stochastic but the explanatory
variables are fixed or non-stochastic.
9
Simple Regression
For simplicity, say k=1. This is the situation where y
depends on only one x variable.
Examples of the kind of relationship that may be of
interest include:
How labour productivity varies with training types
Measuring the long-term relationship between crop
yield and fertilizer use
10
The Simple Regression Model
Definition of the simple linear regression model
Explains variable in terms of variable
Intercept Slope parameter
Dependent variable,
explained variable, Error term,
Independent variable, disturbance,
response variable,…
explanatory variable, unobservables,…
regressor,…
The Simple Regression Model
Interpretation of the simple linear regression model
Studies how varies with changes in :
as long as
By how much does the dependent Interpretation only correct if all other
variable change if the independent things remain equal when the indepen-
variable is increased by one unit? dent variable is increased by one unit
The simple linear regression model is rarely applicable in practice but
its discussion is useful for pedagogical reasons
The Simple Regression Model
Example: Soybean yield and fertilizer
Rainfall,
land quality,
presence of parasites, …
Measures the effect of fertilizer on
yield, holding all other factors fixed
Example: A simple wage equation
Labor force experience,
tenure with current employer,
work ethic, intelligence …
Measures the change in hourly wage
given another year of education,
holding all other factors fixed
Simple Regression: An Example
Suppose that we have the following data on productivity and
primary enrollment for a country.
Year Agriculture_VA_lab Agri_land_hect Agri_population enroll_primary (% gross)
1985 139 57690000 38323000 38
1986 157 57385000 39421000 38
1987 178 57030000 40552000 41
1988 169 56775000 41719000 42
1989 164 56520000 42925000 41
1990 168 56312000 44173000 37
1991 167 56158000 45444000 33
1992 160 56105000 46741000 26
1993 172 30540000 45536000 23
1994 161 30472000 46844000 27
1995 161 30500000 48118000 31
1996 183 30500000 49347000 37
1997 181 30492000 50536000 42
1998 158 30508000 51693000 51
1999 159 30676000 52831000 50
2000 159 30662000 53957000 55
2001 168 31409000 55075000 60
2002 160 30604000 56179000 63
2003 139 31607000 57265000 65
2004 158 33101000 58326000 69
2005 174 33691000 59358000 81
2006 190 34219000 60362000 87
2007 201 35077000 61342000 95
2008 211 34513000 62294000 102
2009 219 63231000 102
2010 226 64158000 102
2011 235 65076000
14
Simple Regression: An Example
We have some intuition that the beta on primary school
gross enrollment is positive, and we therefore want to
find whether there appears to be a relationship between x
and y given the data that we have.
The first stage would be to form a scatter plot of the two
variables.
We can do this by using stata:
scatter agriculture_va_lab enroll_primarygross
15
Graph (Scatter Diagram)
240
220
200
180
160
140
20 40 60 80 100
enroll_primary (% gross)
16
Finding a Line of Best Fit
We can use the general equation for a straight line,
y=a+bx
to get the line that best “fits” the data.
However, this equation (y=a+bx) is completely
deterministic.
Is this realistic? No. So what we do is to add a random
disturbance term, u into the equation.
yt = + xt + ut
where t = 1985,1986,1987,…,2011.
17
Why do we include a Disturbance term?
The disturbance term can capture a number of
features:
We always leave out some determinants of yt
There may be errors in the measurement of yt that
cannot be modelled.
Random outside influences on yt which we cannot
model
18
Determining the Regression Coefficients
So how do we determine what and are?
Choose and so that the (vertical) distances from the data points to the
fitted lines are minimised (so that the line fits the data as closely as
possible):
19
x
Ordinary Least Squares
The most common method used to fit a line to the data is known as
OLS (ordinary least squares).
What we actually do is to take each distance and square it (i.e. take the
area of each of the squares in the diagram) and minimise the total sum
of the squares (hence least squares).
Tightening up the notation, let
yt denote the actual data point t
ŷt denote the fitted value from the regression line
û t denote the residual, yt - ŷt
20
Actual and Fitted Value
.
y
yi
û i
ŷi
xi x
21
How OLS Works
5
So min.uˆ
2
1 uˆ uˆ uˆ uˆ , or minimise uˆ 2 . This is known as the
2
2
2
3
2
4
2
5 t
residual sum of squares. t 1
But what was û t ? It was the difference between the actual point and the
line, yt - ŷ .
t
So minimising ty ˆ
y t 2
is equivalent to minimising t
ˆ
u 2
with respect to $ and $ .
22
Deriving the OLS Estimator
But yˆ t ˆ ˆxt , so let L ( yt yˆ t ) 2 ( y t ˆ ˆxt ) 2
t i
Want to minimise L with respect to (w.r.t.) $ and $ , so differentiate L
$ w.r.t. $ and
L
2 ( yt ˆ ˆxt ) 0 (1)
ˆ t
L
2 xt ( yt ˆ ˆxt ) 0 (2)
ˆ t
From (1), ( y t ˆ ˆxt ) 0 y t Tˆ ˆ xt 0
t
But y t Ty and x t Tx .
23
Deriving the OLS Estimator (cont’d)
ˆ
So we can write Ty Tˆ Tˆx 0 or y ˆ x 0 (3)
From (2), xt ( yt ˆ ˆxt ) 0 (4)
t
From (3), ˆ y ˆx (5)
Substitute into (4) for $ from (5),
x (y
t
t t y ˆx ˆxt ) 0
xt yt y xt x xt xt 0
ˆ ˆ 2
t t
x
t
y T y x t
ˆTx 2 ˆ x 2 0
24
Deriving the OLS Estimator (cont’d)
Solving for, $ ˆ (Tx 2 xt2 ) Tyx xt yt
ˆ
xt yt Txy
and ˆ y ˆx
So overall we have xt2 Tx 2
This method of finding the optimum is known as ordinary least
squares(OLS).
reg agriculture_va_lab enroll_primarygross
Source SS df MS Number of obs = 26
F( 1, 24) = 21.91
Model 5651.68212 1 5651.68212 Prob > F = 0.0001
Residual 6190.47173 24 257.936322 R-squared = 0.4773
Adj R-squared = 0.4555
Total 11842.1538 25 473.686154 Root MSE = 16.06
agriculture_va_lab Coef. Std. Err. t P>t [95% Conf. Interval]
enroll_primarygross .5940433 .126907 4.68 0.000 .3321202 .8559664
_cons 139.5295 7.693246 18.14 0.000 123.6514 155.4075
25
What do We Use $ and $ For?
In the example used above, plugging the 26
observations into make up the formulae given above
would lead to the estimates
$ = 139.5295 and $ = 0.594.
We would write the fitted line as 𝑦=139.53+0.594x
Where y is agricultural labour productivity and x is
gross enrollment(%).
Question: if the gross enrollment increases by 1%, what
will happen to agricultural labour productivity?
26
Accuracy of Intercept Estimate
Care needs to be exercised when considering the
intercept estimate, particularly if there are no or
few observations close to the y-axis:
y
0 x
27
The Population and the Sample
The population is the total collection of all objects or
people to be studied.
A sample is a selection of just some items from the
population.
A random sample is a sample in which each
individual item in the population is equally likely to
be drawn.
28
The Data Generating Process (DGP) and the PRF
The population regression function (PRF) is a description of the
model that is thought to be generating the actual data and the true
relationship between the variables (i.e. the true values of and ).
The PRF is yt xt ut
The Sample Regression Function (SRF) is yˆ t ˆ ˆxt
and we also know that uˆt yt yˆ t
We use the SRF to infer likely values of the PRF.
We also want to know how “good” our estimates of and are.
29
Linearity
In order to use OLS, we need a model which is linear in the parameters (
and ). It does not necessarily have to be linear in the variables (y and x).
Linear in the parameters means that the parameters are not multiplied
together, divided, squared or cubed etc.
Some models can be transformed to linear ones by a suitable substitution
or manipulation, e.g. the exponential regression model
Yt e X t eut ln Yt ln X t ut
Then let yt=ln Yt and xt=ln Xt
yt xt ut
30
Linear and Non-linear Models
This is known as the exponential regression model. Here, the coefficients
can be interpreted as elasticities.
Similarly, if theory suggests that y and x should be inversely related:
yt ut
xt
then the regression can be estimated using OLS by substituting
1
zt
xt
But some models are intrinsically non-linear, e.g.
yt xt ut
31
Estimator or Estimate?
Estimators are the formulae used to calculate the coefficients
Estimates are the actual numerical values for the coefficients.
32
The Assumptions Underlying the Classical Linear
Regression Model (CLRM)
The model which we have used is known as the classical linear regression model.
We observe data for xt, but since yt also depends on ut, we must be specific about
how the ut are generated.
We usually make the following set of assumptions about the ut’s (the
unobservable error terms):
Technical Notation Interpretation
1. E(ut) = 0 The errors have zero mean
2. Var (ut) = 2 The variance of the errors is constant and finite
over all values of xt
3. Cov (ui,uj)=0 The errors are statistically independent of
one another
4. Cov (ut,xt)=0 No relationship between the error and
corresponding x variate
33
The Assumptions Underlying the CLRM Again
An alternative assumption to 4., which is slightly stronger, is that the
xt’s are non-stochastic or fixed in repeated samples.
A fifth assumption is required if we want to make inferences about the
population parameters (the actual and ) from the sample parameters
( $ and $ )
Additional Assumption
5. ut is normally distributed
34
Properties of the OLS Estimator
If assumptions 1. through 4. hold, then the estimators $ and $ determined by
OLS are known as Best Linear Unbiased Estimators (BLUE).
What does the acronym stand for?
“Estimator” - $ is an estimator of the true value of .
“Linear” - $ is a linear estimator
“Unbiased” - On average, the actual value of the $ and $ ’s will be equal to
the true values.
“Best” - means that the OLS estimator $ has minimum variance among
the class of linear unbiased estimators.
35
Consistency/Unbiasedness/Efficiency
Consistent
The least squares estimators$ and$ are consistent. That is, the estimates will
converge to their true values as the sample size increases to infinity. Need the
assumptions E(xtut)=0 and Var(ut)=2 < to prove this. Consistency implies that
Unbiased
lim Pr ˆ 0 0
T
The least squares estimates of $ and $ are unbiased. That is E($ ) = and E($ ) =
Thus on average the estimated value will be equal to the true values. To prove
this also requires the assumption that E(ut)= 0. Unbiasedness is a stronger
condition than consistency.
Efficiency
An estimator $ of parameter is said to be efficient if it is unbiased and no other
unbiased estimator has a smaller variance. If the estimator is efficient, we are
minimising the probability that it is a long way off from the true value of .
36
Example: How to Calculate the Parameters and Standard
Errors
Assume we have the following data calculated from a regression of y on a
single variable x and a constant over 22 observations.
Data:
x y 830102, T 22, x 416.5, y 86.65,
t t
x 3919654, RSS 130.6
2
t
Calculations: $ 830102 (22 * 416.5 * 86.65)
2 0.35
3919654 22 *(416.5)
$ 86.65 035 . 5912
. * 4165 .
We write
yˆ t ˆ ˆxt
yˆ t 59.12 0.35xt
37
Example (cont’d)
SE(regression), s uˆ t2
130.6
2.55
T 2 20
3919654
SE( ) 2.55 * 3.35
22 3919654 22 416.5
2
1
SE( ) 2.55 * 0.0079
3919654 22 416.5 2
We now write the results as
yˆ t 59.12 0.35xt
(3.35) (0.0079)
38
An Introduction to Statistical Inference
We want to make inferences about the likely population values from the
regression parameters.
Example: Suppose we have the following regression results:
yˆ t 20.3 0.5091xt
(14.38) (0.2561)
$ 0.5091 is a single (point) estimate of the unknown population
parameter, . How “reliable” is this estimate?
The reliability of the point estimate is measured by the coefficient’s
standard error.
39
Hypothesis Testing: Some Concepts
We can use the information in the sample to make inferences about the
population.
We will always have two hypotheses that go together, the null hypothesis
(denoted H0) and the alternative hypothesis (denoted H1).
The null hypothesis is the statement or the statistical hypothesis that is actually
being tested. The alternative hypothesis represents the remaining outcomes of
interest.
For example, suppose given the regression results above, we are interested in
the hypothesis that the true value of is in fact 0.5. We would use the notation
H0 : = 0.5
H1 : 0.5
This would be known as a two sided test.
40
One-Sided Hypothesis Tests
Sometimes we may have some prior information that, for example, we
would expect > 0.5 rather than < 0.5. In this case, we would do a
one-sided test:
H0 : = 0.5
H1 : > 0.5
or we could have had
H0 : = 0.5
H1 : < 0.5
There are two ways to conduct a hypothesis test: via the test of
significance approach or via the confidence interval approach.
41
The Probability Distribution of the Least Squares
Estimators
We assume that ut N(0,2)
Since the least squares estimators are linear combinations of the random
variables
i.e. $ wt yt
The weighted sum of normal random variables is also normally distributed, so
$ N(, Var())
$ N(, Var())
What if the errors are not normally distributed? Will the parameter estimates
still be normally distributed?
Yes, if the other assumptions of the CLRM hold, and the sample size is
sufficiently large.
42
The Probability Distribution of the Least Squares
Estimators (cont’d)
Standard normal variates can be constructed from $ and $ :
ˆ and ˆ ~ N 0,1
~ N 0,1
var var
But var() and var() are unknown, so
ˆ ˆ
~ tT 2 and ~ tT 2
SE (ˆ ) ˆ
SE ( )
43
Testing Hypotheses: The Test of Significance
Approach
Assume the regression equation is given by , for t=1,2,...,T
yt xt ut
The steps involved in doing a test of significance are:
1. Estimate $, $ andSE($ ) , SE( $ ) in the usual way
2. Calculate the test statistic. This is given by the formula
$ *
test statistic
SE ( $ )
where *is the value of under the null hypothesis.
44
The Test of Significance Approach (cont’d)
We need some tabulated distribution with which to
compare the estimated test statistics. Test statistics derived
in this way can be shown to follow a t-distribution with T-
2 degrees of freedom.
As the number of degrees of freedom increases, we need
to be less cautious in our approach since we can be more
sure that our results are robust.
45
The Test of Significance Approach (cont’d)
We need to choose a “significance level”, often denoted .
This is also sometimes called the size of the test and it
determines the region where we will reject or not reject the
null hypothesis that we are testing. It is conventional to
use a significance level of 5%.
Intuitive explanation is that we would only expect a
result as extreme as this or more extreme 5% of the time as
a consequence of chance alone. Conventional to use a 5%
size of test, but 10% and 1% are also commonly used.
46
The Confidence Interval Approach to Hypothesis
Testing
An example of its usage: We estimate a parameter,
say to be 0.93, and a “95% confidence interval” to be
(0.77,1.09). This means that we are 95% confident that
the interval containing the true (but unknown) value
of .
Confidence intervals are almost invariably two-sided,
although in theory a one-sided interval can be
constructed.
47
Some More Terminology
If we reject the null hypothesis at the 5% level, we say
that the result of the test is statistically significant.
48
The t-ratio: An Example
Suppose that we have the following parameter estimates, standard errors
and t-ratios for an intercept and slope respectively.
Coefficient 1.10 -4.40
SE 1.35 0.96
t-ratio 0.81 -4.63
Compare this with a tcrit with 15-3 = 12 d.f.
(2½% in each tail for a 5% test) = 2.179 5%
= 3.055 1%
Do we reject H0: 1 = 0? (No)
H0: 2 = 0? (Yes)
49
What Does the t-ratio tell us?
If we reject H0, we say that the result is significant. If the coefficient is not
“significant” (e.g. the intercept coefficient in the last regression above), then
it means that the variable is not helping to explain variations in y.
Variables that are not significant are usually removed from the regression
model.
In practice there are good statistical reasons for always having a constant
even if it is not significant. Look at what happens if no intercept is included:
yt
xt
50