Econometrics 1 (6012B0374Y)
dr. Artūras Juodis
University of Amsterdam
Week 2. Lecture 1
February, 2024
1 / 54
Overview
Multiple regression model
1 Multiple regression setting OLS estimator
Motivation 3 Geometry of OLS
The model Fitted values and residuals
Empirical results Projection matrices
2 Linear model and OLS using The R 2
matrix notation 4 Summary
2 / 54
The plan for this week
I We motive the use of the multiple regression model.
I We introduce matrix and vector notation in this context.
I We use matrix notation to derive and study OLS estimator.
I We provide geometrical interpretation of the OLS estimator.
I We prove the Gauss-Markov theorem (Friday).
I We prove the Frisch-Waugh-Lovell theorem (Friday).
3 / 54
Recap: Linear model
Last week, we considered the simple linear model with a single regressor xi
yi = α + βxi + εi , i = 1, . . . , n. (1)
We used this model to understand the determinants of hotel prices in
Vienna. Unfortunately, empirical results for two different choices of xi were
not sufficiently convincing. This motivates the need to study models with
multiple regressors.
4 / 54
Recap: OLS estimator
We used sample data {(yi , xi )}ni=1 to construct statistics that can be used
as estimates of (α, β). For this purpose, we considered the Ordinary Least
Squares (OLS) objective function.
The OLS estimators: P
i (x− x)(yi − y )
βb = Pi 2
, (2)
i (xi − x)
and that
b = y − x β.
α b (3)
Today, we show how to derive expressions for (bα, β)
b using unified
matrix/vector notation. We will do it for the general setting with multiple
regressors.
5 / 54
1. Multiple regression setting
6 / 54
1.1. Motivation
7 / 54
Scatter plot
250
200
price
150
100
50
0 5 10 15 20
distance_km
Figure: Scatter plot
8 / 54
Two models we considered last week
In order to explain/fit these patterns we considered two possible linear
models with a single regressor:
pricei = α + βdistancei + εi , or (4)
pricei = α + βDi + εi . (5)
Problem: Both of these models were able to explain some of the features in
the above scatter plot, e.g. that most most expensive hotels are the ones
closer to the city center. However, neither of the models was able to
explain/predict/fit what happens for hotels that are far away from the
center.
Can we do better? YES!
9 / 54
Combination of the two
If we believe that distance is more important for hotels closer to the city
center than for those that are outside of the city center, then it is natural to
consider models that combine feature from separate individual models.
Two natural extensions:
pricei = α + β1 distancei + β2 Di + εi (6)
pricei = α + β1 distancei + β2 Di + β3 (distancei × Di ) +εi . (7)
| {z }
interactioni
While the first model simply adds two regressors linearly to the model, the
second one allows for the interaction effect between the distance variable
and whether the hotel is within 2km or not.
Models of the second type are very common in empirical work. More about
that in Week 5.
10 / 54
1.2. The model
11 / 54
Multiple regression model
In what follows we consider the linear regression model with K regressors:
K
X
yi = βk xk,i + εi , i = 1, . . . , n. (8)
k=1
Here we slightly deviate from the convention used before (for a reason
obvious soon) and include the intercept as first regressor, i.e. α = β1 and
x1,i = 1, i = 1, . . . , n.
For the follow up results distinction between regressors that vary between
units (e.g. distance) and the ones that do not (intercept) is immaterial.
12 / 54
Interpretation
The coefficient βk under (modified version of) classical assumptions can be
interpreted as the partial/marginal effect:
∂ E[yi |(x1,i , . . . , xK ,i )]
= βk , k = 2, . . . , K . (9)
∂xk,i
If all regressors are continuous. Hence, βk measure the effect of the marginal
change in xk,i on the conditional expectation of yi (given all regressors).
13 / 54
Interpretation. Hotels example.
In this model (using the new notation)
pricei = β1 + β2 distancei + β3 Di + β4 (distancei × Di ) +εi , (10)
| {z }
interactioni
the coefficients do not have direct marginal effects interpretation because of
the interaction term. Instead:
(
∂ E[pricei |distancei ] β2 + β4 if distance < 2km
= (11)
∂distancei β2 if distance ≥ 2km.
14 / 54
Not all regressors are equal!
Note that while I use the same notation (x1,i , . . . , xK ,i ) for all regressors (so
they are all mathematically equal), in reality some regressors for economists
are of greater interest than others.
We would usually split all regressors into:
I Policy/treatment/primary variables, e.g. variables that contain
information about the exposure of units to treatments/policy changes,
etc.
I Control variables, e.g. demographic characteristics of units. Intercept.
This distinction is fairly new in econometrics, and is mostly driven by the
credibility revolution where more and more studies try to evaluate effects of
policy interventions.
In that case, the effects of variables that can actually be
manipulated/intervened are more important than control variables that
policy makers cannot change.
15 / 54
1.3. Empirical results
16 / 54
Empirical results. Model 1.
Figure: Regression of Price on distancei and dummy variable
Di = 1(distancei < 2km).
17 / 54
Empirical results. Model 2.
Figure: Regression of Price on distancei ,dummy variable Di = 1(distancei < 2km),
and their interactioni .
18 / 54
How should we interpret Model 2? Fitted curves
Translating regression output, we can consider the corresponding fitted lines:
(
172.61 − 43.10 × distance if distance < 2km
yb(distance) = (12)
88.72 + 0.72 × distance if distance ≥ 2km.
19 / 54
Conclusions?
On Model 1.
I Just adding two regressors additively does not improve the fit of the
model substantially, i.e. R 2 goes up marginally. Later this week we
explain why R 2 should always go up.
I From Model 1 it is clear that distance does not matter that much for
the variation of prices. It matters more if the distance <2km or not.
So relationship is not linear afterall.
On Model 2.
I Adding interaction term dramatically improves R 2 . Hence, distance
matters!
I But it is mostly important (and can be explained by the model) only if
you are within the 2km radius from the city center.
I For observations outside the 2km radius from the city center, the effect
of distance is even positive!
I This illustrates why models with interacted explanatory variables are so
popular by applied econometricians and economists.
20 / 54
2. Linear model and OLS using matrix notation
21 / 54
2.1. Multiple regression model
22 / 54
How OLS is calculated with multiple regressors?
OLS coefficients in the multiple regression setting are obtained identically to
the setting with one regressor, i.e.: by minimizing the Least Squares
objective function (or by minimizing the sum of squared errors):
n K
!2
X X
(βb1 , . . . , βbK ) = arg min yi − βk xk,i . (13)
β1 ,...,βK
i=1 k=1
Or as in the first lecture:
(βb1 , . . . , βbK ) = arg min LSn (β1 , . . . , βK ). (14)
β1 ,...,βK
23 / 54
Derivatives
We look at the first partial derivatives of that objective function:
n K
!
∂LSn (β1 , . . . , βK ) X X
= −2 xk,i yi − βk xk,i .
∂βk
i=1 k=1
for all k = 1, . . . , K . Hence, the minimizer of the objective function
(βb1 , . . . , βbK ) should be the zero of the above set of equations (K equations
with K unknowns), i.e.:
n K
!
X X
xk,i yi − βbk xk,i = 0, (15)
i=1 k=1
for all k = 1, . . . , K . Hence, set of K equations in K unknowns!
24 / 54
Not convenient
While previous equations are correct, they are generally inconvenient to
work with. Given your knowledge of (Advanced) Linear Algebra it is way
more appropriate to derive all results using appropriate vector/matrix
notation and the language of system of linear equations.
Note that for any [S × 1] vectors a = (a1 , . . . , aS )0 and b = (b1 , . . . , bS )0 :
S
X
0
a b = b0 a = as bs . (16)
s=1
Here I use the convention that 0 denotes the transpose of a vector, and all
bold quantities are column vectors.
25 / 54
Matrix preliminaries for OLS
Let us introduce the following notation:
y = (y1 , . . . , yn )0 , [n × 1]
ε = (ε1 , . . . , εn )0 , [n × 1]
xi = (x1,i , . . . , xK ,i )0 , [K × 1]
x
(k)
= (xk,1 , . . . , xk,n )0 , [n × 1]
X = (x1 , . . . , xn )0 , [n × K ]
β = (β1 , . . . , βK )0 , [K × 1].
With this notation the linear model just reads as:
yi = xi0 β + εi , (17)
for all i = 1, . . . , n.
26 / 54
Matrix preliminaries for OLS
We can also collect all these n individual models into a system of n
equations:
y1 = x10 β + ε1 ,
y2 = x20 β + ε2 ,
... = ...,
0
yn−1 = xn−1 β + εn−1 ,
yn = xn0 β + εn .
Or simply as:
y = X β + ε. (18)
27 / 54
Example. Vienna Hotels. Model with interaction.
The first five rows of y and X are given by (without any specific sorting):
81 1 2.737 0 0
85 1 2.254 0 0
83 1 2.737 0 0
y = 82 , X = 1 (19)
1.932 1 1.932
103 1 1.449 1 1.449
.. .. .. .. ..
. . . . .
Here the first column of X is the vector of ones (intercept), the second
column is the distance in kilometres, the third column is a binary variable
that indicates if hotel is <2km from the city center, and the final column is
the product of the latter two.
28 / 54
2.2. OLS estimator
29 / 54
OLS using matrix notation
Using matrix notation:
n
X
LSn (β1 , . . . , βK ) = LSn (β) = (yi − xi0 β)2 = (y − X β)0 (y − X β). (20)
i=1
Note that for any value of β the LSn (β) objective function is a scalar!
Given that β is a [K × 1] vector, the derivative of LSn (β) with respect to β
is a [K × 1] vector.
30 / 54
Derivatives
We showed that derivatives are given by:
n K
!
∂LSn (β1 , . . . , βK ) X X
= −2 xk,i yi − βk xk,i .
∂βk
i=1 k=1
Or alternatively using our new notation:
n
∂LSn (β) X
= −2 xk,i (yi − xi0 β) = −2(x (k) )0 (y − X β).
∂βk
i=1
Collecting all such K equations together:
∂LSn (β)
= −2X 0 (y − X β).
∂β
31 / 54
The OLS estimator
From the above we conclude that the OLS estimator βb is the solution to
following set of K equations in K unknowns:
0
X (y − X β)
b = 0K . (21)
If X 0 X is of full rank K , i.e. rank(X 0 X ) = K then the above systems of
equations has a unique solution:
n
!−1 n
!
−1
X X
0 0 0
βb = (X X) (X y) = xi xi x i yi . (22)
i=1 i=1
−1
Here (·) is the usual matrix inverse.
32 / 54
The OLS estimator. Special case
For the special case we analyzed in the previous week xi = (1, xi )0 and
βb = (b b 0 then:
α, β)
Pn Pn −1 Pn
α 1
= Pni=1 Pni=1 x2i Pni=1 yi
b
(23)
βb i=1 xi i=1 xi i=1 xi yi
We arrive at the expression derived previously in the course upon using the
exact formulas for the inversion of a [2 × 2] matrix.
33 / 54
3. Geometry of OLS
34 / 54
3.1. Fitted values and residuals
35 / 54
LS objective function decomposition
Observe that:
b 0 (y − X βb − X (β − β))
LSn (β) = (y − X βb − X (β − β)) b
b 0 (y − X β)
= (y − X β) b 0 X 0 X (β − β)
b + (β − β) b
b 0 X (β − β)
− (y − X β) b 0 X 0 (y − X β)
b − (β − β) b
36 / 54
LS objective function decomposition
Observe that:
b 0 X = (y − X (X 0 X )−1 X 0 y )0 X = y 0 X − y 0 X = 00 .
(y − X β) (24)
K
Hence:
LSn (β) = LSn (β) b 0 X 0 X (β − β)
b + (β − β) b ≥ LSn (β).
b (25)
Why above inequality? Observe that (β − β) b 0 X 0 X (β − β)
b is a quadratic
form. Hence, it is non-negative by construction.
Conclusion? OLS estimator βb is indeed a minimizer of the objective
function LSn (β).
37 / 54
Decomposition
Consider the decomposition (using vector notation) in the explained/fitted
part of y and residual:
y = yb + eb. (26)
Note that:
y
b = X βb = X (X 0 X )−1 X 0 y , (27)
0 −1 0
e
b = y − yb = (In − X (X X) X )y . (28)
Hence both the fitted values and the residuals are certain (linear)
transformations of the original data y .
38 / 54
3.2. Projection matrices
39 / 54
Decomposition
Let:
PX = X (X 0 X )−1 X 0 ,
MX = In − X (X 0 X )−1 X 0 .
Then:
MX + PX = In , (29)
and also:
M X PX = On×n . (30)
These two matrices (MX and PX ) are very special and known to be
projection matrices. Also MX is known to be the residual maker matrix for
an obvious reason.
40 / 54
M
X and P X are projection matrices
General definition. A matrix V is a orthogonal projection matrix if:
V = V 2 = V 0. (31)
It is easy to see that indeed PX is a projection matrix. Next, consider MX :
MX MX = (In − X (X 0 X )−1 X 0 )MX = MX − PX MX = MX , (32)
and obviously that MX = MX0 . Hence MX is also an orthogonal projection
matrix.
41 / 54
Projection matrix P X
What exactly these matrices project onto?
PX is a projection matrix onto the space spanned by the columns X (K of
those). In particular, take any vector z = X γ (hence z is in the span of X ),
then:
0 −1 0
PX z = X (X X ) X X γ = X γ = z. (33)
This means, that if you project something that already lies in the span of X
nothing changes.
42 / 54
Projection matrix M X
Note that dim(X ) = [n × K ], hence if rank(X ) = K , then the dimension of
the corresponding null-space is n − K .
Indeed MX projects off the space spanned by columns of X . In particular,
take any vector z = X γ (hence z is in the span of X ), then:
MX z = X γ − X (X 0 X )−1 X 0 X γ = X γ − X γ = 0n . (34)
43 / 54
OLS geometrically
Hence, OLS geometrically simply helps to project y on two spaces that are
orthogonal to each other:
I yb fitted values that lie in the K dimensional space spanned by columns
X;
I eb residual values, that line in the corresponding orthogonal
complement.
From this definition it is not surprising that:
0
y
b eb = y 0 PX MX y = 0. (35)
44 / 54
Implication. Projection matrices.
One of the most obvious implications of MX X = O is that residuals eb, sum
to 0, i.e:
Xn
ebi = ı0n eb = 0. (36)
i=1
0
Here ın = (1, . . . , 1) is an [n × 1] dimensional vector of ones. In particular,
this results follows from the fact that:
ın = Xe1 , (37)
where e1 = (1, 0, . . . , 0)0 a [K × 1] selection vector (i.e. the vector that
chooses the first column from X ). Hence:
ı0n eb = e10 X 0 MX y = e10 X 0 MX0 y = e10 (MX X )0 y = 0. (38)
45 / 54
3.3. The R 2
46 / 54
Some preliminaries
Consider the decomposition
SST = SSE + SSR. (39)
Recall that we defined R 2 as a function of the SSE (Explained Sum of
Squares):
SSE SSR
R2 ≡ =1− ∈ [0; 1]. (40)
SST SST
In what follows we derive (again) the SST = SSE + SSR decomposition
using matrix algebra and some additional projection matrices.
47 / 54
Demeaning projection matrix
In the definition of SSE we considered the demeaned yi , i.e. yi − y .
Consider the stacked version of this vector (i.e. [n × 1] vector) using the
vector notation:
y
e ≡ y − ın y = y − ın ı0n y /n = y − ın (ı0n ın )−1 ı0n y . (41)
Note that above can be expressed as:
y
e = M1 y , (42)
where M1 = In − ın (ı0n ın )−1 ı0n is an orthogonal projection matrix!
48 / 54
SST = SSE + SSR decomposition
From y = PX y + MX y we can obtain:
y
e = M 1 PX y + M 1 M X y . (43)
This looks complicated! However, notice that:
M1 MX = MX − ın (ı0n ın )−1 ı0n MX , (44)
but we showed previously that ı0n MX = 00n . Hence:
M1 MX = MX . (45)
49 / 54
SST = SSE + SSR decomposition
Using the above result:
y
e = M 1 PX y + M X y . (46)
Such that:
0
y
e ye = y 0 PX0 M10 M1 PX y + y 0 MX0 MX y . (47)
Where we used the fact that because MX = MX0 = MX2 (and the same for
M1 ):
0 0
MX M1 PX = (M1 MX ) PX = MX PX = O. (48)
50 / 54
The R 2
This implies again that:
e0 ye = yb0 M yb + eb0 eb .
y (49)
|{z} | {z1 } |{z}
SST SSE SSR
Hence: 0
SSE y MX y
R2 ≡ =1− . (50)
SST y 0 M1 y
Here we used the fact that eb0 eb = y 0 MX y and ye0 ye = y 0 M1 y .
Hence R 2 is a function of two different quadratic forms in the y vector.
51 / 54
4. Summary
52 / 54
Summary today
In this lecture
I We introduced the multiple regression framework.
I We introduced the vector/matrix notation for this framework.
I We showed how the OLS estimator can be derived using this new
notation.
I We provided a geometrical interpretation for the OLS estimator.
I We gave residuals and fitted values interpretations in terms of the
corresponding orthogonal projections.
53 / 54
On Friday
I We study the statistical properties of the OLS estimator.
I We prove the Gauss-Markov theorem that implies that OLS is the
BLUE estimator.
I We prove the Frisch-Waugh-Lovell theorem.
54 / 54