Chapter 3 Multiple Regression Analysis
Chapter 3 Multiple Regression Analysis
3
Multiple Regression Analysis:
Estimation
I
n Chapter 2, we learned how to use simple regression analysis to explain a dependent
variable, y, as a function of a single independent variable, x. The primary drawback in
using simple regression analysis for empirical work is that it is very difficult to draw
ceteris paribus conclusions about how x affects y: the key assumption, SLR.4 —that all
other factors affecting y are uncorrelated with x—is often unrealistic.
Multiple regression analysis is more amenable to ceteris paribus analysis because
it allows us to explicitly control for many other factors that simultaneously affect the
dependent variable. This is important both for testing economic theories and for evaluat-
ing policy effects when we must rely on nonexperimental data. Because multiple regres-
sion models can accommodate many explanatory variables that may be correlated, we can
hope to infer causality in cases where simple regression analysis would be misleading.
Naturally, if we add more factors to our model that are useful for explaining y, then
more of the variation in y can be explained. Thus, multiple regression analysis can be used
to build better models for predicting the dependent variable.
An additional advantage of multiple regression analysis is that it can incorporate fairly
general functional form relationships. In the simple regression model, only one function
of a single explanatory variable can appear in the equation. As we will see, the multiple
regression model allows for much more flexibility.
Section 3.1 formally introduces the multiple regression model and further discusses
the advantages of multiple regression over simple regression. In Section 3.2, we demon-
strate how to estimate the parameters in the multiple regression model using the method of
ordinary least squares. In Sections 3.3, 3.4, and 3.5, we describe various statistical proper-
ties of the OLS estimators, including unbiasedness and efficiency.
The multiple regression model is still the most widely used vehicle for empirical
analysis in economics and other social sciences. Likewise, the method of ordinary least
squares is popularly used for estimating the parameters of the multiple regression model.
68
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has
deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Chapter 3 Multiple Regression Analysis: Estimation 69
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has
deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
70 Part 1 Regression Analysis with Cross-Sectional Data
where
b0 is the intercept.
b1 is the parameter associated with x1.
b2 is the parameter associated with x2, and so on.
Since there are k independent variables and an intercept, equation (3.6) contains k 1 1
(unknown) population parameters. For shorthand purposes, we will sometimes refer to
the parameters other than the intercept as slope parameters, even though this is not
always literally what they are. [See equation (3.4), where neither b 1 nor b2 is itself a
slope, but together they determine the slope of the relationship between consumption
and income.]
The terminology for multiple regression is similar to that for simple regression and
is given in Table 3.1. Just as in simple regression, the variable u is the error term or
disturbance. It contains factors other than x1, x2, …, xk that affect y. No matter how many
explanatory variables we include in our model, there will always be factors we cannot
include, and these are collectively contained in u.
When applying the general multiple regression model, we must know how to interpret
the parameters. We will get plenty of practice now and in subsequent chapters, but it is
useful at this point to be reminded of some things we already know. Suppose that CEO
salary (salary) is related to firm sales (sales) and CEO tenure (ceoten) with the firm by
This fits into the multiple regression model (with k 5 3) by defining y 5 log(salary), x1 5
log(sales), x2 5 ceoten, and x3 5 ceoten2. As we know from Chapter 2, the parameter b1
is the (ceteris paribus) elasticity of salary with respect to sales. If b3 5 0, then 100b2 is
approximately the ceteris paribus percentage increase in salary when ceoten increases by
one year. When b3 0, the effect of ceoten on salary is more complicated. We will post-
pone a detailed treatment of general models with quadratics until Chapter 6.
Equation (3.7) provides an important reminder about multiple regression analysis.
The term “linear” in a multiple linear regression model means that equation (3.6) is linear
in the parameters, bj. Equation (3.7) is an example of a multiple regression model that,
while linear in the bj, is a nonlinear relationship between salary and the variables sales
and ceoten. Many applications of multiple linear regression involve nonlinear relation-
ships among the underlying variables.
The key assumption for the general multiple regression model is easy to state in terms
of a conditional expectation:
E(ux1, x2, …, xk) 5 0. [3.8]
At a minimum, equation (3.8) requires that all factors in the unobserved error term be
uncorrelated with the explanatory variables. It also means that we have correctly accounted
for the functional relationships between the explained and explanatory variables. Any
problem that causes u to be correlated with any of the independent variables causes (3.8) to
fail. In Section 3.3, we will show that assumption (3.8) implies that OLS is unbiased and
will derive the bias that arises when a key variable has been omitted from the equation. In
Chapters 15 and 16, we will study other reasons that might cause (3.8) to fail and show
what can be done in cases where it does fail.
where
ˆ0 5 the estimate of b0.
b
ˆ1 5 the estimate of b1.
b
ˆ2 5 the estimate of b2.
b
But how do we obtain b ˆ0, b
ˆ1, and b
ˆ2? The method of ordinary least squares chooses the
estimates to minimize the sum of squared residuals. That is, given n observations on y,
ˆ0, b
x1, and x2, {(xi1, xi2, yi): i 5 1, 2, …, n}, the estimates b ˆ1, and b
ˆ
2 are chosen simultane-
ously to make
n
∑ (y 2 bˆ 2 bˆ x
i 0 1 i1
ˆ
2xi2)2
2 b [3.10]
i51
as small as possible.
To understand what OLS is doing, it is important to master the meaning of the index-
ing of the independent variables in (3.10). The independent variables have two subscripts
here, i followed by either 1 or 2. The i subscript refers to the observation number. Thus,
the sum in (3.10) is over all i 5 1 to n observations. The second index is simply a method
of distinguishing between different independent variables. In the example relating wage
to educ and exper, xi1 5 educi is education for person i in the sample, and xi2 5 experi is
∑
n
experience for person i. The sum of squared residuals in equation (3.10) is i51(wagei 2
ˆ0 2 b
b ˆ1educi 2 b
ˆ2experi)2. In what follows, the i subscript is reserved for indexing the
observation number. If we write xij, then this means the ith observation on the jth indepen-
dent variable. (Some authors prefer to switch the order of the observation number and the
variable number, so that x1i is observation i on variable one. But this is just a matter of
notational taste.)
In the general case with k independent variables, we seek estimates b ˆ0, b
ˆ1, …, b
ˆk in
the equation
y ˆ
ˆ 5 b ˆ
0 1 b ˆ
1x1 1 b ˆ
kxk.
2x2 1 … 1 b [3.11]
The OLS estimates, k 1 1 of them, are chosen to minimize the sum of squared residuals:
n
∑ (y 2 bˆ 2 bˆ x
i 0 1 i1
ˆ
kxik)2.
2 … 2 b [3.12]
i51
This minimization problem can be solved using multivariable calculus (see Appendix 3A).
ˆ0, b
This leads to k 1 1 linear equations in k 1 1 unknowns b ˆ1, …, b
ˆk:
n
∑ (y 2 bˆ 2 bˆ x
i 0 1 i1
ˆ
kxik) 5 0
2 … 2 b
i51
n
∑ x (y 2 bˆ 2 bˆ x
i1 i 0 1 i1
ˆ
kxik) 5 0
2 … 2 b
i51
n
∑ x (y 2 bˆ 2 bˆ x
i2 i 0 1 i1
ˆ
kxik) 5 0
2 … 2 b [3.13]
i51
...
n
∑x (y 2 bˆ 2 bˆ x
ik i 0 1 i1
ˆ
kxik) 5 0.
2 … 2 b
i51
These are often called the OLS first order conditions. As with the simple regression
model in Section 2.2, the OLS first order conditions can be obtained by the method of
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has
deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
74 Part 1 Regression Analysis with Cross-Sectional Data
y ˆ
ˆ 5 b ˆ
0 1 b ˆ
2x2.
1x1 1 b [3.14]
The intercept ˆ
b0 in equation (3.14) is the predicted value of y when x1 5 0 and x2 5 0.
Sometimes, setting x1 and x2 both equal to zero is an interesting scenario; in other cases, it
will not make sense. Nevertheless, the intercept is always needed to obtain a prediction of
y from the OLS regression line, as (3.14) makes clear.
The estimates bˆ ˆ
2 have partial effect, or ceteris paribus, interpretations. From
1 and b
equation (3.14), we have
∆y ˆ
ˆ 5 b ˆ
1∆x1 1 b
2∆x2,
so we can obtain the predicted change in y given the changes in x1 and x2. (Note how the
intercept has nothing to do with the changes in y.) In particular, when x2 is held fixed, so
that ∆x2 5 0, then
∆y ˆ
ˆ 5 b
1∆x1,
holding x2 fixed. The key point is that, by including x2 in our model, we obtain a coeffi-
cient on x1 with a ceteris paribus interpretation. This is why multiple regression analysis is
so useful. Similarly,
∆y ˆ
ˆ 5 b
2∆x2,
holding x1 fixed.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has
deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Chapter 3 Multiple Regression Analysis: Estimation 75
The case with more than two independent variables is similar. The OLS regression
line is
y ˆ
ˆ 5 b ˆ ˆ
1x1 1 b
0 1 b ˆ
2x2 1 … 1 b
kxk. [3.16]
Written in terms of changes,
∆y ˆ
ˆ 5 b ˆ
1∆x1 1 b ˆ
2∆x2 1 … 1 b
k∆xk. [3.17]
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has
deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
76 Part 1 Regression Analysis with Cross-Sectional Data
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has
deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Chapter 3 Multiple Regression Analysis: Estimation 77
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has
deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
78 Part 1 Regression Analysis with Cross-Sectional Data
/
n n
ˆ1 5
b ∑rˆ y ∑r ˆ ,
i51
i1 i
i51
2
i1 [3.22]
where the r ˆi1 are the OLS residuals from a simple regression of x1 on x2, using the sample
at hand. We regress our first independent variable, x1, on our second independent variable,
x2, and then obtain the residuals (y plays no role here). Equation (3.22) shows that we can
then do a simple regression of y on r ˆ1. (Note that the residuals r
ˆ1 to obtain b ˆi1 have a zero
sample average, and so b ˆ
1 is the usual slope estimate from simple regression.)
The representation in equation (3.22) gives another demonstration of b ˆ1’s partial
ˆ
effect interpretation. The residuals ri1 are the part of xi1 that is uncorrelated with xi2.
Another way of saying this is that ˆri1 is xi1 after the effects of xi2 have been partialled
out, or netted out. Thus, b ˆ
1 measures the sample relationship between y and x1 after x2 has
been partialled out.
In simple regression analysis, there is no partialling out of other variables because no
other variables are included in the regression. Computer Exercise C5 steps you through
the partialling out process using the wage data from Example 3.2. For practical purposes,
the important thing is that b ˆ1 in the equation y ˆ
ˆ 5 b ˆ
0 1 b ˆ
1x1 1 b
2x2 measures the change in
y given a one-unit increase in x1, holding x2 fixed.
In the general model with k explanatory variables, b ˆ1 can still be written as in equation
ˆ
(3.22), but the residuals r
i1 come from the regression of x1 on x2, …, xk. Thus, b ˆ
1 measures
the effect of x1 on y after x2, …, xk have been partialled or netted out.
where d˜1 is the slope coefficient from the simple regression of xi2 on xi1, i 5 1, …, n. This
equation shows how b ˜ ˆ. The confounding term is the
1 differs from the partial effect of x1 on y
ˆ times the slope in the sample regression of x2 on x1. (See Section 3A.4
partial effect of x2 on y
in the chapter appendix for a more general verification.)
The relationship between b ˜ ˆ
1 and b
1 also shows there are two distinct cases where they
are equal:
Even though simple and multiple regression estimates are almost never identical, we
can use the above formula to characterize why they might be either very different or quite
similar. For example, if b ˆ2 is small, we might expect the multiple and simple regression
estimates of b1 to be similar. In Example 3.1, the sample correlation between hsGPA and
ACT is about 0.346, which is a nontrivial correlation. But the coefficient on ACT is fairly
little. It is not surprising to find that the simple regression of colGPA on hsGPA produces
a slope estimate of .482, which is not much different from the estimate .453 in (3.15).
In the case with k independent variables, the simple regression of y on x1 and the mul-
tiple regression of y on x1, x2, …, xk produce an identical estimate of x1 only if (1) the OLS
coefficients on x2 through xk are all zero or (2) x1 is uncorrelated with each of x2, …, xk.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has
deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
80 Part 1 Regression Analysis with Cross-Sectional Data
Neither of these is very likely in practice. But if the coefficients on x2 through xk are
small, or the sample correlations between x1 and the other independent variables are in-
substantial, then the simple and multiple regression estimates of the effect of x1 on y can
be similar.
Goodness-of-Fit
As with simple regression, we can define the total sum of squares (SST), the explained
sum of squares (SSE), and the residual sum of squares or sum of squared residuals
(SSR) as
n
_ 2
SST ∑ (y 2 y )
i [3.24]
i51
n
_ 2
SSE ∑ (yˆ 2 y )
i [3.25]
i51
n
SSR ∑ uˆ .
i
2
[3.26]
i51
Using the same argument as in the simple regression case, we can show that
SST 5 SSE 1 SSR. [3.27]
ˆi} and in {u
In other words, the total variation in {yi} is the sum of the total variations in {y ˆi}.
Assuming that the total variation in y is nonzero, as is the case unless yi is constant in
the sample, we can divide (3.27) by SST to get
SSR/SST 1 SSE/SST 5 1.
The fact that R2 never decreases when any variable is added to a regression makes
it a poor tool for deciding whether one variable or several variables should be added to
a model. The factor that should determine whether an explanatory variable belongs in a
model is whether the explanatory variable has a nonzero partial effect on y in the popu-
lation. We will show how to test this hypothesis in Chapter 4 when we cover statisti-
cal inference. We will also see that, when used properly, R2 allows us to test a group of
variables to see if it is important for explaining y. For now, we use it as a goodness-of-fit
measure for a given model.
Example 3.5 deserves a final word of caution. The fact that the four explanatory
variables included in the second regression explain only about 4.2% of the variation in
narr86 does not necessarily mean that the equation is useless. Even though these variables
collectively do not explain much of the variation in arrests, it is still possible that the OLS
estimates are reliable estimates of the ceteris paribus effects of each independent variable
on narr86. As we will see, whether this is the case does not directly depend on the size of
R2. Generally, a low R2 indicates that it is hard to predict individual outcomes on y with
much accuracy, something we study in more detail in Chapter 6. In the arrest example,
the small R2 reflects what we already suspect in the social sciences: it is generally very
difficult to predict individual behavior.
where pcnv is a proxy for the likelihood for being convicted of a crime and avgsen is a
measure of expected severity of punishment, if convicted. The variable ptime86 captures
the incarcerative effects of crime: if an individual is in prison, he cannot be arrested for a
crime outside of prison. Labor market opportunities are crudely captured by qemp86.
First, we estimate the model without the variable avgsen. We obtain
narr86
5 .712 2 .150 pcnv 2 .034 ptime86 2 .104 qemp86
n 5 2,725, R2 5 .0413.
This equation says that, as a group, the three variables pcnv, ptime86, and qemp86 explain
about 4.1% of the variation in narr86.
Each of the OLS slope coefficients has the anticipated sign. An increase in the
proportion of convictions lowers the predicted number of arrests. If we increase pcnv
by .50 (a large increase in the probability of conviction), then, holding the other factors fixed,
∆
narr86 5 2.150(.50) 5 2.075. This may seem unusual because an arrest cannot change
by a fraction. But we can use this value to obtain the predicted change in expected arrests
for a large group of men. For example, among 100 men, the predicted fall in arrests when
pcnv increases by .50 is 27.5.
Similarly, a longer prison term leads to a lower predicted number of arrests. In fact, if
ptime86 increases from 0 to 12, predicted arrests for a particular man fall by .034(12) 5 .408.
Another quarter in which legal employment is reported lowers predicted arrests by .104,
which would be 10.4 arrests among 100 men.
If avgsen is added to the model, we know that R 2 will increase. The estimated
equation is
narr86
5 .707 2 .151 pcnv 1 .0074 avgsen 2 .037 ptime86 2 .103 qemp86
n 5 2,725, R2 5 .0422.
Thus, adding the average sentence variable increases R2 from .0413 to .0422, a practically
small effect. The sign of the coefficient on avgsen is also unexpected: it says that a longer
average sentence length increases criminal activity.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has
deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Chapter 3 Multiple Regression Analysis: Estimation 83
Equation (3.31) formally states the population model, sometimes called the true model,
to allow for the possibility that we might estimate a model that differs from (3.31). The
key feature is that the model is linear in the parameters b0, b1, …, bk. As we know, (3.31)
is quite flexible because y and the independent variables can be arbitrary functions of the
underlying variables of interest, such as natural logarithms and squares [see, for example,
equation (3.7)].
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has
deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
84 Part 1 Regression Analysis with Cross-Sectional Data
Sometimes, we need to write the equation for a particular observation i: for a randomly
drawn observation from the population, we have
yi 5 b0 1 b1xi1 1 b2xi 2 1 … 1 bkxik 1 ui . [3.32]
Remember that i refers to the observation, and the second subscript on x is the variable
number. For example, we can write a CEO salary equation for a particular CEO i as
log(salaryi) 5 b0 1 b1log(salesi) 1 b2ceoteni 1 b3ceoteni2 1 ui . [3.33]
The term ui contains the unobserved factors for CEO i that affect his or her salary. For
a pplications, it is usually easiest to write the model in population form, as in (3.31). It
contains less clutter and emphasizes the fact that we are interested in estimating a population
relationship.
In light of model (3.31), the OLS estimators ˆ b0, ˆ
b1, ˆ
b2, …, ˆ
bk from the regression
of y on x1, …, xk are now considered to be estimators of b0, b1, …, bk. In Section 3.2,
we saw that OLS chooses the intercept and slope estimates for a particular sample so that
the residuals average to zero and the sample correlation between each independent vari-
able and the residuals is zero. Still, we did not include conditions under which the OLS
estimates are well defined for a given sample. The next assumption fills that gap.
Assumption MLR.3 is more complicated than its counterpart for simple regression because
we must now look at relationships between all independent variables. If an independent
variable in (3.31) is an exact linear combination of the other independent variables, then
we say the model suffers from perfect collinearity, and it cannot be estimated by OLS.
It is important to note that Assumption MLR.3 does allow the independent variables
to be correlated; they just cannot be perfectly correlated. If we did not allow for any corre-
lation among the independent variables, then multiple regression would be of very limited
use for econometric analysis. For example, in the model relating test scores to educational
expenditures and average family income,
we fully expect expend and avginc to be correlated: school districts with high average
family incomes tend to spend more per student on education. In fact, the primary motiva-
tion for including avginc in the equation is that we suspect it is correlated with expend,
and so we would like to hold it fixed in the analysis. Assumption MLR.3 only rules out
perfect correlation between expend and avginc in our sample. We would be very unlucky
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has
deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Chapter 3 Multiple Regression Analysis: Estimation 85
to obtain a sample where per student expenditures are perfectly correlated with average
family income. But some correlation, perhaps a substantial amount, is expected and
certainly allowed.
The simplest way that two independent variables can be perfectly correlated is when
one variable is a constant multiple of another. This can happen when a researcher inad-
vertently puts the same variable measured in different units into a regression equation.
For example, in estimating a relationship between consumption and income, it makes no
sense to include as independent variables income measured in dollars as well as income
measured in thousands of dollars. One of these is redundant. What sense would it make to
hold income measured in dollars fixed while changing income measured in thousands of
dollars?
We already know that different nonlinear functions of the same variable can appear
among the regressors. For example, the model cons 5 b0 1 b1inc 1 b2inc2 1 u does not
violate Assumption MLR.3: even though x2 5 inc2 is an exact function of x1 5 inc, inc2
is not an exact linear function of inc. Including inc2 in the model is a useful way to gen-
eralize functional form, unlike including income measured in dollars and in thousands of
dollars.
Common sense tells us not to include the same explanatory variable measured in dif-
ferent units in the same regression equation. There are also more subtle ways that one
independent variable can be a multiple of another. Suppose we would like to estimate an
extension of a constant elasticity consumption function. It might seem natural to specify a
model such as
where x1 5 log(inc) and x2 5 log(inc2). Using the basic properties of the natural log (see
Appendix A), log(inc2) 5 2log(inc). That is, x2 5 2x1, and naturally this holds for all
observations in the sample. This violates Assumption MLR.3. What we should do instead
is include [log(inc)]2, not log(inc2), along with log(inc). This is a sensible extension of the
constant elasticity model, and we will see how to interpret such models in Chapter 6.
Another way that independent variables can be perfectly collinear is when one inde-
pendent variable can be expressed as an exact linear function of two or more of the other
independent variables. For example, suppose we want to estimate the effect of campaign
spending on campaign outcomes. For simplicity, assume that each election has two candi-
dates. Let voteA be the percentage of the vote for Candidate A, let expendA be campaign
expenditures by Candidate A, let expendB be campaign expenditures by Candidate B, and
let totexpend be total campaign expenditures; the latter three variables are all measured in
dollars. It may seem natural to specify the model as
in order to isolate the effects of spending by each candidate and the total amount of spend-
ing. But this model violates Assumption MLR.3 because x3 5 x1 1 x2 by definition. Trying
to interpret this equation in a ceteris paribus fashion reveals the problem. The parameter
of b1 in equation (3.35) is supposed to measure the effect of increasing expenditures by
Candidate A by one dollar on Candidate A’s vote, holding Candidate B’s spending and
total spending fixed. This is nonsense, because if expendB and totexpend are held fixed,
then we cannot increase expendA.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has
deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
86 Part 1 Regression Analysis with Cross-Sectional Data
The solution to the perfect collinearity in (3.35) is simple: drop any one of the three
variables from the model. We would probably drop totexpend, and then the coefficient on
expendA would measure the effect of increasing expenditures by A on the percentage of
the vote received by A, holding the spending by B fixed.
The prior examples show that Assumption MLR.3 can fail if we are not careful in
specifying our model. Assumption MLR.3 also fails if the sample size, n, is too small
in relation to the number of parameters being estimated. In the general regression model
in equation (3.31), there are k 1 1 parameters, and MLR.3 fails if n , k 1 1. Intuitively,
this makes sense: to estimate k 1 1 parameters, we need at least k 1 1 observations. Not
surprisingly, it is better to have as many observations as possible, something we will see
with our variance calculations in Section 3.4.
If the model is carefully specified
Exploring Further 3.3 and n k 1 1, Assumption MLR.3 can
fail in rare cases due to bad luck in col-
In the previous example, if we use as ex-
lecting the sample. For example, in a
planatory variables expendA, expendB, and
shareA, where shareA 5 100(expendA/
wage equation with education and ex-
totexpend) is the percentage share of total perience as variables, it is possible that
campaign expenditures made by Candidate we could obtain a random sample where
A, does this violate Assumption MLR.3? each individual has exactly twice as
much education as years of experience.
This scenario would cause Assumption MLR.3 to fail, but it can be considered very un-
likely unless we have an extremely small sample size.
The final, and most important, assumption needed for unbiasedness is a direct exten-
sion of Assumption SLR.4.
One way that Assumption MLR.4 can fail is if the functional relationship between the
explained and explanatory variables is misspecified in equation (3.31): for example, if we
forget to include the quadratic term inc2 in the consumption function cons 5 b0 1 b1inc 1
b2inc2 1 u when we estimate the model. Another functional form misspecification occurs
when we use the level of a variable when the log of the variable is what actually shows up
in the population model, or vice versa. For example, if the true model has log(wage) as the
dependent variable but we use wage as the dependent variable in our regression analysis,
then the estimators will be biased. Intuitively, this should be pretty clear. We will discuss
ways of detecting functional form misspecification in Chapter 9.
Omitting an important factor that is correlated with any of x 1, x 2, …, x k causes
Assumption MLR.4 to fail also. With multiple regression analysis, we are able to include
many factors among the explanatory variables, and omitted variables are less likely to be
a problem in multiple regression analysis than in simple regression analysis. Nevertheless,
in any application, there are always factors that, due to data limitations or ignorance, we
will not be able to include. If we think these factors should be controlled for and they are
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has
deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Chapter 3 Multiple Regression Analysis: Estimation 87
correlated with one or more of the independent variables, then Assumption MLR.4 will be
violated. We will derive this bias later.
There are other ways that u can be correlated with an explanatory variable. In Chapters 9
and 15, we will discuss the problem of measurement error in an explanatory variable. In
Chapter 16, we cover the conceptually more difficult problem in which one or more of the
explanatory variables is determined jointly with y—as occurs when we view quantities
and prices as being determined by the intersection of supply and demand curves. We must
postpone our study of these problems until we have a firm grasp of multiple regression
analysis under an ideal set of assumptions.
When Assumption MLR.4 holds, we often say that we have exogenous explanatory
variables. If xj is correlated with u for any reason, then xj is said to be an endogenous
explanatory variable. The terms “exogenous” and “endogenous” originated in simultane-
ous equations analysis (see Chapter 16), but the term “endogenous explanatory variable”
has evolved to cover any case in which an explanatory variable may be correlated with the
error term.
Before we show the unbiasedness of the OLS estimators under MLR.1 to MLR.4,
a word of caution. Beginning students of econometrics sometimes confuse Assumptions
MLR.3 and MLR.4, but they are quite different. Assumption MLR.3 rules out certain
relationships among the independent or explanatory variables and has nothing to do with
the error, u. You will know immediately when carrying out OLS estimation whether or
not Assumption MLR.3 holds. On the other hand, Assumption MLR.4—the much more
important of the two—restricts the relationship between the unobserved factors in u and
the explanatory variables. Unfortunately, we will never know for sure whether the average
value of the unobserved factors is unrelated to the explanatory variables. But this is the
critical assumption.
We are now ready to show unbiasedness of OLS under the first four multiple regres-
sion assumptions. As in the simple regression case, the expectations are conditional on
the values of the explanatory variables in the sample, something we show explicitly in
Appendix 3A but not in the text.
In our previous empirical examples, Assumption MLR.3 has been satisfied (because
we have been able to compute the OLS estimates). Furthermore, for the most part, the
samples are randomly chosen from a well-defined population. If we believe that the speci-
fied models are correct under the key Assumption MLR.4, then we can conclude that OLS
is unbiased in these examples.
Since we are approaching the point where we can use multiple regression in serious
empirical work, it is useful to remember the meaning of unbiasedness. It is tempting, in
examples such as the wage equation in (3.19), to say something like “9.2% is an unbiased
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has
deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
88 Part 1 Regression Analysis with Cross-Sectional Data
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has
deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Chapter 3 Multiple Regression Analysis: Estimation 89
and earlier in this chapter that this problem generally causes the OLS estimators to be
biased. It is time to show this explicitly and, just as importantly, to derive the direction
and size of the bias.
Deriving the bias caused by omitting an important variable is an example of
misspecification analysis. We begin with the case where the true population model has
two explanatory variables and an error term:
y 5 b0 1 b1x1 1 b2x2 1 u, [3.40]
and we assume that this model satisfies Assumptions MLR.1 through MLR.4.
Suppose that our primary interest is in b1, the partial effect of x1 on y. For example, y
is hourly wage (or log of hourly wage), x1 is education, and x2 is a measure of innate abil-
ity. In order to get an unbiased estimator of b1, we should run a regression of y on x1 and
x2 (which gives unbiased estimators of b0, b1, and b2). However, due to our ignorance or
data unavailability, we estimate the model by excluding x2. In other words, we perform a
simple regression of y on x1 only, obtaining the equation
y ˜0 1 b
˜ 5 b ˜
1x1. [3.41]
We use the symbol “~” rather than “^” to emphasize that b˜
1 comes from an underspecified
model.
When first learning about the omitted variable problem, it can be difficult to distin-
guish between the underlying true model, (3.40) in this case, and the model that we actu-
ally estimate, which is captured by the regression in (3.41). It may seem silly to omit the
variable x2 if it belongs in the model, but often we have no choice. For example, suppose
that wage is determined by
wage 5 b0 1 b1educ 1 b2abil 1 u. [3.42]
Since ability is not observed, we instead estimate the model
wage 5 b0 1 b1educ 1 v,
where v 5 b2abil 1 u. The estimator of b1 from the simple regression of wage on educ is
what we are calling b ˜1.
We derive the expected value of ˜ b1 conditional on the sample values of x1 and x2.
Deriving this expectation is not difficult because b ˜1 is just the OLS slope estimator from
a simple regression, and we have already studied this estimator extensively in Chapter 2.
The difference here is that we must analyze its properties when the simple regression
model is misspecified due to an omitted variable.
As it turns out, we have done almost all of the work to derive the bias in the simple
regression estimator of ˜ b1. From equation (3.23) we have the algebraic relationship
˜1 5 b
b ˆ1 1 b ˜1, where b
ˆ2δ ˆ1 and b
ˆ2 are the slope estimators (if we could have them) from
the multiple regression
yi on xi1, xi2, i 5 1, …, n [3.43]
̴
and δ 1 is the slope from the simple regression
xi2 on xi1, i 5 1, …, n. [3.44]
̴
Because δ 1 depends only on the independent variables in the sample, we treat it as
fixed (nonrandom) when computing E(˜b1). Further, since the model in (3.40) satisfies
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has
deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
90 Part 1 Regression Analysis with Cross-Sectional Data
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has
deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Chapter 3 Multiple Regression Analysis: Estimation 91
levels of education. Thus, the OLS estimates from the simple regression equation wage 5
b0 1 b1educ 1 v are on average too large. This does not mean that the estimate obtained
from our sample is too big. We can only say that if we collect many random samples and
obtain the simple regression estimates each time, then the average of these estimates will
be greater than b1.
As a second example, suppose that, at the elementary school level, the average score
for students on a standardized exam is determined by
avgscore 5 b0 1 b1expend 1 b2 povrate 1 u, [3.48]
where expend is expenditure per student and povrate is the poverty rate of the children
in the school. Using school district data, we only have observations on the percentage of
students with a passing grade and per student expenditures; we do not have information on
poverty rates. Thus, we estimate b1 from the simple regression of avgscore on expend.
We can again obtain the likely bias in ˜ b1. First, b2 is probably negative: There is
ample evidence that children living in poverty score lower, on average, on standardized
tests. Second, the average expenditure per student is probably negatively correlated with
the poverty rate: The higher the poverty rate, the lower the average per student spending,
so that Corr(x1, x2) , 0. From Table 3.2, b ˜1 will have a positive bias. This observation has
important implications. It could be that the true effect of spending is zero; that is, b1 5 0.
However, the simple regression estimate of b1 will usually be greater than zero, and this
could lead us to conclude that expenditures are important when they are not.
When reading and performing empirical work in economics, it is important to master
the terminology associated with biased estimators. In the context of omitting a variable from
model (3.40), if E(b˜1) . b1, then we say that b ˜1 has an upward bias. When E(b ˜1) , b1,
˜1 has a downward bias. These definitions are the same whether b1 is positive or negative.
b
The phrase biased toward zero refers to cases where E(b ˜1) is closer to zero than is b1. There
˜1 is biased toward zero if it has a downward bias. On the other
fore, if b1 is positive, then b
˜
hand, if b1 , 0, then b
1 is biased toward zero if it has an upward bias.
explanatory variable and the error generally results in all OLS estimators being biased.
For example, suppose the population model
satisfies Assumptions MLR.1 through MLR.4. But we omit x3 and estimate the model as
y ˜
˜ 5 b ˜
0 1 b ˜
2x2.
1x1 1 b [3.50]
Now, suppose that x2 and x3 are uncorrelated, but that x1 is correlated with x3. In other
words, x1 is correlated with the omitted variable, but x2 is not. It is tempting to think that,
while b˜
1 is probably biased based on the derivation in the previous subsection, b ˜2 is unbi-
ased because x2 is uncorrelated with x3. Unfortunately, this is not generally the case: both
˜1 and b
b ˜2 will normally be biased. The only exception to this is when x1 and x2 are also
uncorrelated.
Even in the fairly simple model above, it can be difficult to obtain the direction of bias
˜1 and b
in b ˜2. This is because x1, x2, and x3 can all be pairwise correlated. Nevertheless, an
approximation is often practically useful. If we assume that x1 and x2 are uncorrelated, then
we can study the bias in b ˜1 as if x2 were absent from both the population and the estimated
models. In fact, when x1 and x2 are uncorrelated, it can be shown that
n
_
(xi1 2 x1)xi3∑
˜1) 5 b1 1 b3 _____________________
E(b i51
n
.
_ 2
(xi1 2 x1) ∑
i51
This is just like equation (3.45), but b3 replaces b2, and x3 replaces x2 in regression (3.44).
˜1 is obtained by replacing b2 with b3 and x2 with x3 in Table 3.2. If
Therefore, the bias in b
b3 0 and Corr(x1, x3) . 0, the bias in b˜
1 is positive, and so on.
As an example, suppose we add exper to the wage model:
If abil is omitted from the model, the estimators of both b1 and b2 are biased, even if we
assume exper is uncorrelated with abil. We are mostly interested in the return to educa-
˜1 has an upward or a downward bias
tion, so it would be nice if we could conclude that b
due to omitted ability. This conclusion is not possible without further assumptions. As an
approximation, let us suppose that, in addition to exper and abil being uncorrelated, educ
and exper are also uncorrelated. (In reality, they are somewhat negatively correlated.)
Since b3 . 0 and educ and abil are positively correlated, b ˜1 would have an upward bias,
just as if exper were not in the model.
The reasoning used in the previous example is often followed as a rough guide for
obtaining the likely bias in estimators in more complicated models. Usually, the focus is
on the relationship between a particular explanatory variable, say, x1, and the key omit-
ted factor. Strictly speaking, ignoring all other explanatory variables is a valid practice
only when each one is uncorrelated with x1, but it is still a useful guide. Appendix 3A
contains a more careful analysis of omitted variable bias with multiple explanatory
variables.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has
deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Chapter 3 Multiple Regression Analysis: Estimation 93
Assumption MLR.5 means that the variance in the error term, u, conditional on the
explanatory variables, is the same for all combinations of outcomes of the explanatory
variables. If this assumption fails, then the model exhibits heteroskedasticity, just as in the
two-variable case.
In the equation
wage 5 b0 1 b1educ 1 b2exper 1 b3tenure 1 u,
homoskedasticity requires that the variance of the unobserved error u does not depend on
the levels of education, experience, or tenure. That is,
Var(uueduc, exper, tenure) 5 s2.
If this variance changes with any of the three explanatory variables, then heteroskedasticity
is present.
Assumptions MLR.1 through MLR.5 are collectively known as the Gauss-Markov
assumptions (for cross-sectional regression). So far, our statements of the assumptions
are suitable only when applied to cross-sectional analysis with random sampling. As we
will see, the Gauss-Markov assumptions for time series analysis, and for other situa-
tions such as panel data analysis, are more difficult to state, although there are many
similarities.
In the discussion that follows, we will use the symbol x to denote the set of all in-
dependent variables, (x1, …, xk). Thus, in the wage regression with educ, exper, and ten-
ure as independent variables, x 5 (educ, exper, tenure). Then we can write Assumptions
MLR.1 and MLR.4 as
E(yux) 5 b0 1 b1x1 1 b2x2 1 … 1 bkxk,
and Assumption MLR.5 is the same as Var(yux) 5 s2. Stating the assumptions in this
way clearly illustrates how Assumption MLR.5 differs greatly from Assumption MLR.4.
Assumption MLR.4 says that the expected value of y, given x, is linear in the parameters,
but it certainly depends on x1, x2, …, xk. Assumption MLR.5 says that the variance of y,
given x, does not depend on the values of the independent variables.
We can now obtain the variances of the bˆj, where we again condition on the sample
values of the independent variables. The proof is in the appendix to this chapter.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has
deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
94 Part 1 Regression Analysis with Cross-Sectional Data
The careful reader may be wondering whether there is a simple formula for the vari-
ˆj where we do not condition on the sample outcomes of the explanatory variables.
ance of b
The answer is: None that is useful. The formula in (3.51) is a highly nonlinear function of
the xij, making averaging out across the population distribution of the explanatory vari-
ables virtually impossible. Fortunately, for any practical purpose equation (3.51) is what
we want. Even when we turn to approximate, large-sample properties of OLS in Chapter 5
it turns out that (3.51) estimates the quantity we need for large-sample analysis, provided
Assumptions MLR.1 through MLR.5 hold.
Before we study equation (3.51) in more detail, it is important to know that all of the
Gauss-Markov assumptions are used in obtaining this formula. Whereas we did not need
the homoskedasticity assumption to conclude that OLS is unbiased, we do need it to vali-
date equation (3.51).
ˆj) is practically important. A larger variance means a less precise
The size of Var(b
estimator, and this translates into larger confidence intervals and less accurate hypotheses
tests (as we will see in Chapter 4). In the next subsection, we discuss the elements com-
prising (3.51).
The Error Variance, s2. From equation (3.51), a larger s 2 means larger variances for
the OLS estimators. This is not at all surprising: more “noise” in the equation (a larger s 2)
makes it more difficult to estimate the partial effect of any of the independent variables on
y, and this is reflected in higher variances for the OLS slope estimators. Because s 2 is a
feature of the population, it has nothing to do with the sample size. It is the one component
of (3.51) that is unknown. We will see later how to obtain an unbiased estimator of s 2.
For a given dependent variable y, there is really only one way to reduce the error
variance, and that is to add more explanatory variables to the equation (take some factors
out of the error term). Unfortunately, it is not always possible to find additional legitimate
factors that affect y.
The Total Sample Variation in xj , SSTj. From equation (3.51), we see that the larger
ˆj). Thus, everything else being equal, for
the total variation in xj is, the smaller is Var(b
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has
deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Chapter 3 Multiple Regression Analysis: Estimation 95
The Linear Relationships among the Independent Variables, R2j . The term R2j in equa-
tion (3.51) is the most difficult of the three components to understand. This term does not
appear in simple regression analysis because there is only one independent variable in
such cases. It is important to see that this R-squared is distinct from the R‑squared in the
regression of y on x1, x2, …, xk: R2j is obtained from a regression involving only the inde-
pendent variables in the original model, where xj plays the role of a dependent variable.
Consider first the k 5 2 case: y 5 b0 1 b1x1 1 b2x2 1 u. Then, Var( b ˆ1) 5 s2/[SST1(1 2
2 2
R1)], where R1 is the R-squared from the simple regression of x1 on x2 (and an intercept,
as always). Because the R-squared measures goodness-of-fit, a value of R21 close to one
indicates that x2 explains much of the variation in x1 in the sample. This means that x1 and
x2 are highly correlated.
As R21 increases to one, Var(b ˆ1) gets larger and larger. Thus, a high degree of linear
relationship between x1 and x2 can lead to large variances for the OLS slope estimators.
(A similar argument applies to b ˆ2.) See Figure 3.1 for the relationship between Var(b ˆ 1)
and the R-squared from the regression of x1 on x2.
In the general case, Rj2is the proportion of the total variation in xj that can be explained
by the other independent variables appearing in the equation. For a given s2 and SSTj, the
smallest Var(b ˆj) is obtained when Rj25 0, which happens if, and only if, xj has zero sample
correlation with every other independent variable. This is the best case for estimating bj,
but it is rarely encountered.
The other extreme case, R2j 5 1, is ruled out by Assumption MLR.3, because R2j 5 1
means that, in the sample, xj is a perfect linear combination of some of the other indepen-
dent variables in the regression. A more relevant case is when R2j is “close” to one. From
equation (3.51) and Figure 3.1, we see that this can cause Var(b ˆj) to be large: Var(b
ˆj) → `
as Rj → 1. High (but not perfect) correlation between two or more independent variables
2
is called multicollinearity.
Before we discuss the multicollinearity issue further, it is important to be very clear
on one thing: A case where R2j is close to one is not a violation of Assumption MLR.3.
Since multicollinearity violates none of our assumptions, the “problem” of multicol-
linearity is not really well defined. When we say that multicollinearity arises for estimat-
ing bj when R2j is “close” to one, we put “close” in quotation marks because there is no
absolute number that we can cite to conclude that multicollinearity is a problem. For ex-
ample, R2j 5 .9 means that 90% of the sample variation in xj can be explained by the other
independent variables in the regression model. Unquestionably, this means that xj has a
strong linear relationship to the other independent variables. But whether this translates
into a Var(ˆ bj) that is too large to be useful depends on the sizes of s2 and SSTj. As we
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has
deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
96 Part 1 Regression Analysis with Cross-Sectional Data
Var( ˆ 1)
will see in Chapter 4, for statistical inference, what ultimately matters is how big b ˆj is in
relation to its standard deviation.
Just as a large value of R2j can cause a large Var(ˆ
bj), so can a small value of SSTj.
Therefore, a small sample size can lead to large sampling variances, too. Worrying about
high degrees of correlation among the independent variables in the sample is really no dif-
ferent from worrying about a small sample size: both work to increase Var(b ˆj). The famous
University of Wisconsin econometrician Arthur Goldberger, reacting to econometricians’
obsession with multicollinearity, has (tongue in cheek) coined the term micronumerosity,
which he defines as the “problem of small sample size.” [For an engaging discussion of
multicollinearity and micronumerosity, see Goldberger (1991).]
Although the problem of multicollinearity cannot be clearly defined, one thing is clear:
everything else being equal, for estimating bj, it is better to have less correlation between
xj and the other independent variables. This observation often leads to a discussion of how
to “solve” the multicollinearity problem. In the social sciences, where we are usually pas-
sive collectors of data, there is no good way to reduce variances of unbiased estimators
other than to collect more data. For a given data set, we can try dropping other independent
variables from the model in an effort to reduce multicollinearity. Unfortunately, dropping
a variable that belongs in the population model can lead to bias, as we saw in Section 3.3.
Perhaps an example at this point will help clarify some of the issues raised concern-
ing multicollinearity. Suppose we are interested in estimating the effect of various school
expenditure categories on student performance. It is likely that expenditures on teacher
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has
deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Chapter 3 Multiple Regression Analysis: Estimation 97
where x2 and x3 are highly correlated. Then Var(b ˆ2) and Var(b
ˆ3) may be large. But the
amount of correlation between x2 and x3 has no direct effect on Var(b ˆ1). In fact, if x1 is
uncorrelated with x2 and x3, then R21 5
Exploring Further 3.4 ˆ1) 5 s2/SST1, regardless of
0 and Var(b
how much correlation there is between x2
Suppose you postulate a model explain-
and x3. If b1 is the parameter of interest,
ing final exam score in terms of class at-
tendance. Thus, the dependent variable is
we do not really care about the amount
final exam score, and the key explanatory of correlation between x2 and x3.
variable is number of classes attended. The previous observation is important
To control for student abilities and efforts because economists often include many
outside the classroom, you include among control variables in order to isolate the
the explanatory variables cumulative GPA, causal effect of a particular variable. For
SAT score, and measures of high school example, in looking at the relationship
performance. Someone says, “You can- between loan approval rates and percent-
not hope to learn anything from this exer- age of minorities in a neighborhood, we
cise because cumulative GPA, SAT score, might include variables like average in-
and high school performance are likely to
come, average housing value, measures
be highly collinear.” What should be your
response?
of creditworthiness, and so on, because
these factors need to be accounted for in
order to draw causal conclusions about discrimination. Income, housing prices, and cred-
itworthiness are generally highly correlated with each other. But high correlations among
these controls do not make it more difficult to determine the effects of discrimination.
Some researchers find it useful to compute statistics intended to determine the severity
of multicollinearity in a given application. Unfortunately, it is easy to misuse such statistics
because, as we have discussed, we cannot specify how much correlation among explanatory
variables is “too much.” Some multicollinearity “diagnostics” are omnibus statistics in the
sense that they detect a strong linear relationship among any subset of explanatory variables.
For reasons that we just saw, such statistics are of questionable value because they might
reveal a “problem” simply because two control variables, whose coefficients we do not care
about, are highly correlated. [Probably the most common omnibus multicollinearity statistic
is the so-called condition number, which is defined in terms of the full data matrix and is
beyond the scope of this text. See, for example, Belsley, Kuh, and Welsh (1980).]
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has
deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
98 Part 1 Regression Analysis with Cross-Sectional Data
Somewhat more useful, but still prone to misuse, are statistics for individual coef-
ficients. The most common of these is the variance inflation factor (VIF), which is
obtained directly from equation (3.51). The VIF for slope coefficient j is simply VIFj 5
ˆj) that is determined by correlation between xj and
1/(1 2 R2j ), precisely the term in Var( b
the other explanatory variables. We can write Var( b ˆj) in equation (3.51) as
2
s ∙ VIFj,
ˆj) 5 ____
Var( b
SSTj
which shows that VIFj is the factor by which Var(b ˆj) is higher because xj is not uncor-
related with the other explanatory variables. Because VIFj is a function of R2j —indeed,
Figure 3.1 is essentially a graph of VIF1—our previous discussion can be cast entirely
in terms of the VIF. For example, if we had the choice, we would like VIFj to be smaller
(other things equal). But we rarely have the choice. if we think certain explanatory vari-
ables need to be included in a regression to infer causality of xj, then we are hesitant to
drop them, and whether we think VIF j is “too high” cannot really affect that decision.
If, say, our main interest is in the causal effect of x1 on y, then we should ignore entirely
the VIFs of other coefficients. Finally, setting a cutoff value for VIF above which we
conclude multicollinearity is a “problem” is arbitrary and not especially helpful. Some-
times the value 10 is chosen: if VIFj is above 10 (equivalently, R2j is above .9), then
we conclude that multicollinearity is a “problem” for estimating bj. But a VIFj above
10 does not mean that the standard deviation of b ˆj is too large to be useful because
the standard deviation also depends on s and SSTj, and the latter can be increased by
increasing the sample size. Therefore, just as with looking at the size of R2j directly,
looking at the size of VIFj is of limited use, although one might want to do so out of
curiosity.
y 5 b0 1 b1x1 1 b2x2 1 u.
ˆ
We consider two estimators of b1. The estimator b
1 comes from the multiple regression
y ˆ
ˆ 5 b ˆ
0 1 b ˆ
1x1 1 b
2x2. [3.52]
˜1 is
In other words, we include x2, along with x1, in the regression model. The estimator b
obtained by omitting x2 from the model and running a simple regression of y on x1:
y ˜
˜ 5 b ˜
0 1 b
1x1. [3.53]
When b 2 0, equation (3.53) excludes a relevant variable from the model and, as
we saw in Section 3.3, this induces a bias in b ˜1 unless x1 and x2 are uncorrelated. On the
ˆ
other hand, b
1 is unbiased for b1 for any value of b2, including b2 5 0. It follows that, if bias
is used as the only criterion, b ˜
ˆ1 is preferred to b
1.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has
deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Chapter 3 Multiple Regression Analysis: Estimation 99
The conclusion that b ˜1 does not carry over when we bring
ˆ1 is always preferred to b
variance into the picture. Conditioning on the values of x1 and x2 in the sample, we have,
from (3.51),
ˆ1) 5 s2/ [SST1(1 2 R21)],
Var(b [3.54]
2
where SST1 is the total variation in x1, and R is the R-squared from the regression of x1
1
on x2. Further, a simple modification of the proof in Chapter 2 for two-variable regression
shows that
˜1) 5 s2/SST1.
Var(b [3.55]
Comparing (3.55) to (3.54) shows that Var(b ˜1) is always smaller than Var(b
ˆ1), unless
˜
x1 and x2 are uncorrelated in the sample, in which case the two estimators b ˆ1 are
1 and b
the same. Assuming that x 1 and x 2 are not uncorrelated, we can draw the following
conclusions:
˜
1. When b2 0, b ˆ
1 is biased, b ˜1) , Var( b
1 is unbiased, and Var( b ˆ1).
˜
2. When b2 5 0, b ˆ
1 and b
1 are both unbiased, and Var( b˜1) , Var( b
ˆ1).
From the second conclusion, it is clear that b ˜1 is preferred if b2 5 0. Intuitively, if x2 does
not have a partial effect on y, then including it in the model can only exacerbate the multi-
collinearity problem, which leads to a less efficient estimator of b1. A higher variance for
the estimator of b1 is the cost of including an irrelevant variable in a model.
The case where b2 0 is more difficult. Leaving x2 out of the model results in a
biased estimator of b1. Traditionally, econometricians have suggested comparing the likely
size of the bias due to omitting x2 with the reduction in the variance—summarized in the
size of R12—to decide whether x2 should be included. However, when b2 0, there are two
favorable reasons for including x2 in the model. The most important of these is that any
˜1 does not shrink as the sample size grows; in fact, the bias does not necessarily
bias in b
follow any pattern. Therefore, we can usefully think of the bias as being roughly the same
for any sample size. On the other hand, Var(b ˜1) and Var(b
ˆ1) both shrink to zero as n gets
large, which means that the multicollinearity induced by adding x2 becomes less important
as the sample size grows. In large samples, we would prefer b ˆ
1.
The other reason for favoring bˆ
1 is more subtle. The variance formula in (3.55) is con-
ditional on the values of xi1 and xi2 in the sample, which provides the best scenario for b ˜1.
˜
When b2 0, the variance of b1 conditional only on x1 is larger than that presented in
(3.55). Intuitively, when b2 0 and x2 is excluded from the model, the error variance in-
creases because the error effectively contains part of x2. But (3.55) ignores the error vari-
ance increase because it treats both regressors as nonrandom. A full discussion of which
independent variables to condition on would lead us too far astray. It is sufficient to say
that (3.55) is too generous when it comes to measuring the precision in b ˜1.
b2xi2 2 … 2 bk xik, and so the reason we do not observe the ui is that we do not know the
bj. When we replace each bj with its OLS estimator, we get the OLS residuals:
u ˆ
ˆi 5 yi 2 b ˆ
0 2 b ˆ ˆ
2xi2 2 … 2 b
1xi1 2 b k xik.
It seems natural to estimate s2 by replacing ui with the u
ˆi. In the simple regression case,
we saw that this leads to a biased estimator. The unbiased estimator of s2 in the general
multiple regression case is
/
n
ˆ 2 5
s ∑ uˆ (n 2 k 2 1) 5 SSR / (n 2 k 2 1).
i51
2
i [3.56]
ˆ2, denoted s
The positive square root of s ˆ, is called the standard error of the regression
(SER). The SER is an estimator of the standard deviation of the error term. This estimate
is usually reported by regression packages, although it is called different things by differ-
ent packages. (In addition to SER, sˆ is also called the standard error of the estimate and
the root mean squared error.)
Note that ˆ
s can either decrease or increase when another independent variable is
added to a regression (for a given sample). This is because, although SSR must fall when
another explanatory variable is added, the degrees of freedom also falls by one. Because
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has
deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Chapter 3 Multiple Regression Analysis: Estimation 101
SSR is in the numerator and df is in the denominator, we cannot tell beforehand which
effect will dominate.
For constructing confidence intervals and conducting tests in Chapter 4, we will need
ˆ
to estimate the standard deviation of b
j, which is just the square root of the variance:
ˆj) 5 s/ [SSTj(1 2 R2j )]1/2.
sd(b
ˆ. This gives us the standard error
Since s is unknown, we replace it with its estimator, s
ˆj:
of b
ˆj) 5 s
se(b / [SSTj(1 2 R2j )]1/2.
ˆ [3.58]
Just as the OLS estimates can be obtained for any given sample, so can the standard errors.
ˆj) depends on s
Since se( b ˆ, the standard error has a sampling distribution, which will play
a role in Chapter 4.
We should emphasize one thing about standard errors. Because (3.58) is obtained
directly from the variance formula in (3.51), and because (3.51) relies on the homoske-
dasticity Assumption MLR.5, it follows that the standard error formula in (3.58) is not a
ˆj) if the errors exhibit heteroskedasticity. Thus, while the presence
valid estimator of sd( b
of heteroskedasticity does not cause bias in the b ˆj, it does lead to bias in the usual formula
ˆ
for Var( bj), which then invalidates the standard errors. This is important because any re-
gression package computes (3.58) as the default standard error for each coefficient (with a
somewhat different representation for the intercept). If we suspect heteroskedasticity, then
the “usual” OLS standard errors are invalid, and some corrective action should be taken.
We will see in Chapter 8 what methods are available for dealing with heteroskedasticity.
For some purposes it is helpful to write
ˆ ˆ
______
s
j) 5 ______________
se( b __ . [3.59]
n sd(xj) √1 2 R2j
√
________________
_
in which we take sd(xj) 5 √ n i51(xij
21 n
∑
2 xj)2 to be the sample standard deviation where
the total sum of squares is divided by n rather than n − 1. The importance of equation
(3.59) is that it shows how the sample size, n, directly affects the standard errors. The other
three terms in the formula—s ˆ, sd(xj), and R2j —will change with different samples, but as
n gets large they settle down to constants. Therefore, we can see from equation (3.59) that
__
the standard errors shrink to zero at the rate 1/√n . This formula demonstrates the value of
getting more data: the precision of the bˆj increases as n increases. (By contrast, recall that
unbiasedness holds for any sample size subject to being able to compute the estimators.)
We will talk more about large sample properties of OLS in Chapter 5.
through MLR.5, the OLS estimator ˆ bj for bj is the best linear unbiased estimator
(BLUE). To state the theorem, we need to understand each component of the acronym
“BLUE.” First, we know what an estimator is: it is a rule that can be applied to any sample
of data to produce an estimate. We also know what an unbiased estimator is: in the current
context, an estimator, say, b˜j, of bj is an unbiased estimator of bj if E(b
˜j) 5 bj for any b0,
b1, …, bk.
What about the meaning of the term “linear”? In the current context, an estimator b ˜j
of bj is linear if, and only if, it can be expressed as a linear function of the data on the de-
pendent variable:
n
˜j 5
b ∑ w y ,
i51
ij i [3.60]
where each wij can be a function of the sample values of all the independent variables. The
OLS estimators are linear, as can be seen from equation (3.22).
Finally, how do we define “best”? For the current theorem, best is defined as having
the smallest variance. Given two unbiased estimators, it is logical to prefer the one with
the smallest variance (see Appendix C).
ˆ
Now, let b ˆ1, …, b
0, b ˆk denote the OLS estimators in model (3.31) under Assumptions
MLR.1 through MLR.5. The Gauss-Markov Theorem says that, for any estimator b ˜j that is
ˆ ˜
linear and unbiased, Var(bj) # Var(bj), and the inequality is usually strict. In other words,
in the class of linear unbiased estimators, OLS has the smallest variance (under the five
Gauss-Markov assumptions). Actually, the theorem says more than this. If we want to
estimate any linear function of the bj, then the corresponding linear combination of the
OLS estimators achieves the smallest variance among all linear unbiased estimators. We
conclude with a theorem, which is proven in Appendix 3A.
It is because of this theorem that Assumptions MLR.1 through MLR.5 are known as the
Gauss-Markov assumptions (for cross-sectional analysis).
The importance of the Gauss-Markov Theorem is that, when the standard set of
assumptions holds, we need not look for alternative unbiased estimators of the form in
(3.60): none will be better than OLS. Equivalently, if we are presented with an estimator
that is both linear and unbiased, then we know that the variance of this estimator is at least
as large as the OLS variance; no additional calculation is needed to show this.
For our purposes, Theorem 3.4 justifies the use of OLS to estimate multiple regres-
sion models. If any of the Gauss-Markov assumptions fail, then this theorem no longer
holds. We already know that failure of the zero conditional mean assumption (Assumption
MLR.4) causes OLS to be biased, so Theorem 3.4 also fails. We also know that heteroske-
dasticity (failure of Assumption MLR.5) does not cause OLS to be biased. However, OLS
no longer has the smallest variance among linear unbiased estimators in the presence of
heteroskedasticity. In Chapter 8, we analyze an estimator that improves upon OLS when
we know the brand of heteroskedasticity.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has
deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Chapter 3 Multiple Regression Analysis: Estimation 103
one describes the data source (which ideally is obtained via random sampling) as well as
the OLS estimates obtained from the sample. A proper way to introduce a discussion of
the estimates is to say “I estimated equation (3.62) by ordinary least squares. Under the
assumption that no important variables have been omitted from the equation, and assum-
ing random sampling, the OLS estimator of the class size effect, b1, is unbiased. If the
error term u has constant variance, the OLS estimator is actually best linear unbiased.” As
we will see in Chapters 4 and 5, we can often say even more about OLS. Of course, one
might want to admit that while controlling for third-grade math score, family income, and
parents’ education might account for important differences across students, it might not be
enough—for example, u can include motivation of the student or parents—in which case
OLS might be biased.
A more subtle reason for being careful in distinguishing between an underlying
population model and an estimation method used to estimate a model is that estimation
methods such as OLS can be used as essentially an exercise in curve fitting or prediction,
without explicitly worrying about an underlying model and the usual statistical properties
of unbiasedness and efficiency. For example, we might just want to use OLS to estimate
a line that allows us to predict future college GPA for a set of high school students with
given characteristics.
Summary
1. T he multiple regression model allows us to effectively hold other factors fixed while
examining the effects of a particular independent variable on the dependent variable. It
explicitly allows the independent variables to be correlated.
2. Although the model is linear in its parameters, it can be used to model nonlinear relation-
ships by appropriately choosing the dependent and independent variables.
3. The method of ordinary least squares is easily applied to estimate the multiple regression
model. Each slope estimate measures the partial effect of the corresponding independent
variable on the dependent variable, holding all other independent variables fixed.
4. R2 is the proportion of the sample variation in the dependent variable explained by the
independent variables, and it serves as a goodness-of-fit measure. It is important not to put
too much weight on the value of R2 when evaluating econometric models.
5. Under the first four Gauss-Markov assumptions (MLR.1 through MLR.4), the OLS esti-
mators are unbiased. This implies that including an irrelevant variable in a model has no
effect on the unbiasedness of the intercept and other slope estimators. On the other hand,
omitting a relevant variable causes OLS to be biased. In many circumstances, the direction
of the bias can be determined.
6. Under the five Gauss-Markov assumptions, the variance of an OLS slope estimator is given
by Var( b ˆj) 5 s2/[SSTj(1 2 R2j )]. As the error variance s2 increases, so does Var( b
ˆj), while
ˆ 2
Var( bj) decreases as the sample variation in xj, SSTj, increases. The term Rj measures the
amount of collinearity between xj and the other explanatory variables. As R2j approaches
ˆj) is unbounded.
one, Var( b
7. Adding an irrelevant variable to an equation generally increases the variances of the
remaining OLS estimators because of multicollinearity.
8. Under the Gauss-Markov assumptions (MLR.1 through MLR.5), the OLS estimators are
the best linear unbiased estimators (BLUEs).
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has
deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Chapter 3 Multiple Regression Analysis: Estimation 105
Key Terms
Best Linear Unbiased Explained Sum of Squares OLS Intercept Estimate
Estimator (BLUE) (SSE) OLS Regression Line
Biased Toward Zero First Order Conditions OLS Slope Estimate
Ceteris Paribus Gauss-Markov Assumptions Omitted Variable Bias
Degrees of Freedom (df ) Gauss-Markov Theorem Ordinary Least Squares
Disturbance Inclusion of an Irrelevant Overspecifying the Model
Downward Bias Variable Partial Effect
Endogenous Explanatory Intercept Perfect Collinearity
Variable Micronumerosity Population Model
Error Term Misspecification Analysis Residual
Excluding a Relevant Multicollinearity Residual Sum of Squares
Variable Multiple Linear Regression Sample Regression Function
Exogenous Explanatory Model (SRF)
Variable Multiple Regression Analysis Slope Parameter
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has
deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
106 Part 1 Regression Analysis with Cross-Sectional Data
ˆ
Standard Deviation of b
j Sum of Squared Residuals Underspecifying the Model
ˆ
Standard Error of b
j (SSR) Upward Bias
Standard Error of the Total Sum of Squares (SST) Variance Inflation
Regression (SER) True Model Factor (VIF)
Problems
1 Using the data in GPA2.RAW on 4,137 college students, the following equation was esti-
mated by OLS:
colgpa 5 1.392 2 .0135 hsperc 1 .00148 sat
n 5 4,137, R2 5 .273,
where colgpa is measured on a four-point scale, hsperc is the percentile in the high school
graduating class (defined so that, for example, hsperc 5 5 means the top 5% of the class),
and sat is the combined math and verbal scores on the student achievement test.
(i) Why does it make sense for the coefficient on hsperc to be negative?
(ii) What is the predicted college GPA when hsperc 5 20 and sat 5 1,050?
(iii) Suppose that two high school graduates, A and B, graduated in the same percentile
from high school, but Student A’s SAT score was 140 points higher (about one stan-
dard deviation in the sample). What is the predicted difference in college GPA for
these two students? Is the difference large?
(iv) Holding hsperc fixed, what difference in SAT scores leads to a predicted colgpa dif-
ference of .50, or one-half of a grade point? Comment on your answer.
2 The data in WAGE2.RAW on working men was used to estimate the following equation:
educ 5 10.36 2 .094 sibs 1 .131 meduc 1 .210 feduc
n 5 722, R2 5 .214,
where educ is years of schooling, sibs is number of siblings, meduc is mother’s years of
schooling, and feduc is father’s years of schooling.
(i) Does sibs have the expected effect? Explain. Holding meduc and feduc fixed, by how
much does sibs have to increase to reduce predicted years of education by one year?
(A noninteger answer is acceptable here.)
(ii) Discuss the interpretation of the coefficient on meduc.
(iii) Suppose that Man A has no siblings, and his mother and father each have 12 years of
education. Man B has no siblings, and his mother and father each have 16 years of
education. What is the predicted difference in years of education between B and A?
3 The following model is a simplified version of the multiple regression model used by Bid-
dle and Hamermesh (1990) to study the tradeoff between time spent sleeping and working
and to look at other factors affecting sleep:
sleep 5 b0 1 b1totwrk 1 b2educ 1 b3age 1 u,
where sleep and totwrk (total work) are measured in minutes per week and educ and age
are measured in years. (See also Computer Exercise C3 in Chapter 2.)
(i) If adults trade off sleep for work, what is the sign of b1?
(ii) What signs do you think b2 and b3 will have?
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has
deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Chapter 3 Multiple Regression Analysis: Estimation 107
5 3,638.25 2 .148 totwrk 2 11.13 educ 1 2.20 age
sleep
n 5 706, R2 5 .113.
If someone works five more hours per week, by how many minutes is sleep predicted
to fall? Is this a large tradeoff?
(iv) Discuss the sign and magnitude of the estimated coefficient on educ.
(v) Would you say totwrk, educ, and age explain much of the variation in sleep? What
other factors might affect the time spent sleeping? Are these likely to be correlated
with totwrk?
4 The median starting salary for new law school graduates is determined by
where LSAT is the median LSAT score for the graduating class, GPA is the median college
GPA for the class, libvol is the number of volumes in the law school library, cost is the an-
nual cost of attending law school, and rank is a law school ranking (with rank 5 1 being
the best).
(i) Explain why we expect b5 # 0.
(ii) What signs do you expect for the other slope parameters? Justify your answers.
(iii) Using the data in LAWSCH85.RAW, the estimated equation is
log(salary ) 5 8.34 1 .0047 LSAT 1 .248 GPA 1 .095 log(libvol)
1 .038 log(cost) 2 .0033 rank
n 5 136, R2 5 .842.
What is the predicted ceteris paribus difference in salary for schools with a median
GPA different by one point? (Report your answer as a percentage.)
(iv) Interpret the coefficient on the variable log(libvol).
(v) Would you say it is better to attend a higher ranked law school? How much is a
difference in ranking of 20 worth in terms of predicted starting salary?
5 In a study relating college grade point average to time spent in various activities, you dis-
tribute a survey to several students. The students are asked how many hours they spend
each week in four activities: studying, sleeping, working, and leisure. Any activity is put
into one of the four categories, so that for each student, the sum of hours in the four activi-
ties must be 168.
(i) In the model
does it make sense to hold sleep, work, and leisure fixed, while changing study?
(ii) Explain why this model violates Assumption MLR.3.
(iii) How could you reformulate the model so that its parameters have a useful interpreta-
tion and it satisfies Assumption MLR.3?
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has
deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
108 Part 1 Regression Analysis with Cross-Sectional Data
6 Consider the multiple regression model containing three independent variables, under
Assumptions MLR.1 through MLR.4:
You are interested in estimating the sum of the parameters on x 1 and x 2; call this
u1 5 b1 1 b2.
(i) Show that u ˆ
ˆ 1 5 b ˆ
1 1 b
2 is an unbiased estimator of u1.
ˆ ˆ1), Var( b
(ii) Find Var(u1) in terms of Var( b ˆ2), and Corr( b
ˆ1, b
ˆ
2).
7 Which of the following can cause OLS estimators to be biased?
(i) Heteroskedasticity.
(ii) Omitting an important variable.
(iii) A sample correlation coefficient of .95 between two independent variables both in-
cluded in the model.
8 Suppose that average worker productivity at manufacturing firms (avgprod ) depends on
two factors, average hours of training (avgtrain) and average worker ability (avgabil ):
Assume that this equation satisfies the Gauss-Markov assumptions. If grants have been
given to firms whose workers have less than average ability, so that avgtrain and avgabil
˜1 obtained from the simple regression of
are negatively correlated, what is the likely bias in b
avgprod on avgtrain?
9 The following equation describes the median housing price in a community in terms of
amount of pollution (nox for nitrous oxide) and the average number of rooms in houses in
the community (rooms):
(i) What are the probable signs of b1 and b2? What is the interpretation of b1? Explain.
(ii) Why might nox [or more precisely, log(nox)] and rooms be negatively correlated? If
this is the case, does the simple regression of log(price) on log(nox) produce an up-
ward or a downward biased estimator of b1?
(iii) Using the data in HPRICE2.RAW, the following equations were estimated:
Is the relationship between the simple and multiple regression estimates of the
elasticity of price with respect to nox what you would have predicted, given your an-
swer in part? (ii) Does this mean that 2.718 is definitely closer to the true elasticity
than 21.043?
10 Suppose that you are interested in estimating the ceteris paribus relationship between y and
x1. For this purpose, you can collect data on two control variables, x2 and x3. (For concrete-
ness, you might think of y as final exam score, x1 as class attendance, x2 as GPA up through
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has
deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Chapter 3 Multiple Regression Analysis: Estimation 109
and this model satisifies Assumptions MLR.1 through MLR.4. However, we estimate the
˜0, b
model that omits x3. Let b ˜ ˜2 be the OLS estimators from the regression of y on x1
1, and b
and x2. Show that the expected value of b ˜1 (given the values of the independent variables
in the sample) is n
ˆi1xi3
r ∑
˜ _______
E(b1) 5 b1 1 b3 n
i51
,
ˆ2i1
r ∑
i51
ˆi1 are the OLS residuals from the regression of x1 on x2. [Hint: The formula for
where the r
˜1 comes from equation (3.22). Plug yi 5 b0 1 b1xi1 1 b2xi2 1 b3xi3 1 ui into this equa-
b
ˆ i1 as nonrandom.]
tion. After some algebra, take the expectation treating xi3 and r
12 The following equation represents the effects of tax revenue mix on subsequent employ-
ment growth for the population of counties in the United States:
where growth is the percentage change in employment from 1980 to 1990, shareP is the
share of property taxes in total tax revenue, shareI is the share of income tax revenues, and
shareS is the share of sales tax revenues. All of these variables are measured in 1980. The
omitted share, shareF, includes fees and miscellaneous taxes. By definition, the four shares
add up to one. Other factors would include expenditures on education, infrastructure, and
so on (all measured in 1980).
(i) Why must we omit one of the tax share variables from the equation?
(ii) Give a careful interpretation of b1.
13 (i) Consider the simple regression model y 5 b0 1 b1x 1 u under the first four Gauss-
Markov assumptions. For some function g(x), for example g(x) 5 x 2 or g(x) 5
log(1 1 x2), define zi 5 g(xi). Define a slope estimator as
/
n n
Show that ˜b1 is linear and unbiased. Remember, because E(uux) 5 0, you can treat
both xi and zi as nonrandom in your derivation.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has
deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
110 Part 1 Regression Analysis with Cross-Sectional Data
/
n n
˜1),
ˆ1) # Var( b
(iii) Show directly that, under the Gauss-Markov assumptions, Var( b
ˆ1 is the OLS estimator. [Hint: The Cauchy-Schwartz inequality in Appendix
where b
B implies that
n n n
Computer Exercises
C1 A problem of interest to health officials (and others) is to determine the effects of smok-
ing during pregnancy on infant health. One measure of infant health is birth weight; a
birth weight that is too low can put an infant at risk for contracting various illnesses.
Since factors other than cigarette smoking that affect birth weight are likely to be cor-
related with smoking, we should take those factors into account. For example, higher
income generally results in access to better prenatal care, as well as better nutrition for
the mother. An equation that recognizes this is
bwght 5 b0 1 b1cigs 1 b2 faminc 1 u.
(i) What is the most likely sign for b2?
(ii) Do you think cigs and faminc are likely to be correlated? Explain why the correla-
tion might be positive or negative.
(iii) Now, estimate the equation with and without faminc, using the data in BWGHT
.RAW. Report the results in equation form, including the sample size and
R-squared. Discuss your results, focusing on whether adding faminc substantially
changes the estimated effect of cigs on bwght.
C2 Use the data in HPRICE1.RAW to estimate the model
price 5 b0 1 b1sqrft 1 b2bdrms 1 u,
where price is the house price measured in thousands of dollars.
(i) Write out the results in equation form.
(ii) What is the estimated increase in price for a house with one more bedroom, hold-
ing square footage constant?
(iii) What is the estimated increase in price for a house with an additional bedroom that
is 140 square feet in size? Compare this to your answer in part (ii).
(iv) What percentage of the variation in price is explained by square footage and num-
ber of bedrooms?
(v) The first house in the sample has sqrft 5 2,438 and bdrms 5 4. Find the predicted
selling price for this house from the OLS regression line.
(vi) The actual selling price of the first house in the sample was $300,000 (so price 5
300). Find the residual for this house. Does it suggest that the buyer underpaid or
overpaid for the house?
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has
deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Chapter 3 Multiple Regression Analysis: Estimation 111
C3 The file CEOSAL2.RAW contains data on 177 chief executive officers and can be used
to examine the effects of firm performance on CEO salary.
(i) Estimate a model relating annual salary to firm sales and market value. Make the
model of the constant elasticity variety for both independent variables. Write the
results out in equation form.
(ii) Add profits to the model from part (i). Why can this variable not be included in
logarithmic form? Would you say that these firm performance variables explain
most of the variation in CEO salaries?
(iii) Add the variable ceoten to the model in part (ii). What is the estimated percentage
return for another year of CEO tenure, holding other factors fixed?
(iv) Find the sample correlation coefficient between the variables log(mktval) and
profits. Are these variables highly correlated? What does this say about the OLS
estimators?
C4 Use the data in ATTEND.RAW for this exercise.
(i) Obtain the minimum, maximum, and average values for the variables atndrte,
priGPA, and ACT.
(ii) Estimate the model
atndrte 5 b0 1 b1priGPA 1 b2 ACT 1 u,
and write the results in equation form. Interpret the intercept. Does it have a useful
meaning?
(iii) Discuss the estimated slope coefficients. Are there any surprises?
(iv) What is the predicted atndrte if priGPA 5 3.65 and ACT 5 20? What do you
make of this result? Are there any students in the sample with these values of the
explanatory variables?
(v) If Student A has priGPA 5 3.1 and ACT 5 21 and Student B has priGPA 5 2.1
and ACT 5 26, what is the predicted difference in their attendance rates?
C5 C
onfirm the partialling out interpretation of the OLS estimates by explicitly doing the
partialling out for Example 3.2. This first requires regressing educ on exper and tenure
and saving the residuals, rˆ1. Then, regress log(wage) on r
ˆ1. Compare the coefficient on
ˆr1 with the coefficient on educ in the regression of log(wage) on educ, exper, and
tenure.
C6 se the data set in WAGE2.RAW for this problem. As usual, be sure all of the follow-
U
ing regressions contain an intercept.
˜ .
(i) Run a simple regression of IQ on educ to obtain the slope coefficient, say, δ1
(ii) Run the simple regression of log(wage) on educ, and obtain the slope
˜1.
coefficient, b
(iii) Run the multiple regression of log(wage) on educ and IQ, and obtain the
slope coefficients, bˆ ˆ
1 and b
2, respectively.
(iv) Verify that b ˜1 5 b
ˆ ˜
ˆ2δ
1 1 b 1.
C7 Use the data in MEAP93.RAW to answer this question.
(i) Estimate the model
math10 5 b0 1 b1log(expend) 1 b2lnchprg 1 u,
and report the results in the usual form, including the sample size and R-squared.
Are the signs of the slope coefficients what you expected? Explain.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has
deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
112 Part 1 Regression Analysis with Cross-Sectional Data
(ii) What do you make of the intercept you estimated in part (i)? In particular, does
it make sense to set the two explanatory variables to zero? [Hint: Recall that
log(1)50.]
(iii) Now run the simple regression of math10 on log(expend), and compare the slope
coefficient with the estimate obtained in part (i). Is the estimated spending effect
now larger or smaller than in part (i)?
(iv) Find the correlation between lexpend 5 log(expend) and lnchprg. Does its sign
make sense to you?
(v) Use part (iv) to explain your findings in part (iii).
C8 se the data in DISCRIM.RAW to answer this question. These are zip code–level data
U
on prices for various items at fast-food restaurants, along with characteristics of the zip
code population, in New Jersey and Pennsylvania. The idea is to see whether fast-food
restaurants charge higher prices in areas with a larger concentration of blacks.
(i) Find the average values of prpblck and income in the sample, along with their
standard deviations. What are the units of measurement of prpblck and income?
(ii) Consider a model to explain the price of soda, psoda, in terms of the proportion of
the population that is black and median income:
Estimate this model by OLS and report the results in equation form, including the
sample size and R-squared. (Do not use scientific notation when reporting the esti-
mates.) Interpret the coefficient on prpblck. Do you think it is economically large?
(iii) Compare the estimate from part (ii) with the simple regression estimate from
psoda on prpblck. Is the discrimination effect larger or smaller when you control
for income?
(iv) A model with a constant price elasticity with respect to income may be more
appropriate. Report estimates of the model
If prpblck increases by .20 (20 percentage points), what is the estimated percent-
age change in psoda? (Hint: The answer is 2.xx, where you fill in the “xx.”)
(v) Now add the variable prppov to the regression in part (iv). What happens
ˆprpblck?
to b
(vi) Find the correlation between log(income) and prppov. Is it roughly what you
expected?
(vii) Evaluate the following statement: “Because log(income) and prppov are so highly
correlated, they have no business being in the same regression.”
C9 Use the data in CHARITY.RAW to answer the following questions:
(i) Estimate the equation
by OLS and report the results in the usual way, including the sample size and
R-squared. How does the R-squared compare with that from the simple regression
that omits giftlast and propresp?
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has
deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Chapter 3 Multiple Regression Analysis: Estimation 113
(ii) Interpret the coefficient on mailsyear. Is it bigger or smaller than the correspond-
ing simple regression coefficient?
(iii) Interpret the coefficient on propresp. Be careful to notice the units of measure-
ment of propresp.
(iv) Now add the variable avggift to the equation. What happens to the estimated effect
of mailsyear?
(v) In the equation from part (iv), what has happened to the coefficient on giftlast?
What do you think is happening?
C10 Use the data in HTV.RAW to answer this question. The data set includes information on
wages, education, parents’ education, and several other variables for 1,230 working men
in 1991.
(i) What is the range of the educ variable in the sample? What percentage of men
completed 12th grade but no higher grade? Do the men or their parents have, on
average, higher levels of education?
(ii) Estimate the regression model
educ 5 b0 1 b1motheduc 1 b2 fatheduc 1 u
by OLS and report the results in the usual form. How much sample variation in
educ is explained by parents’ education? Interpret the coefficient on motheduc.
(iii) Add the variable abil (a measure of cognitive ability) to the regression from
part (ii), and report the results in equation form. Does “ability” help to explain
variations in education, even after controlling for parents’ education? Explain.
(iv) (Requires calculus) Now estimate an equation where abil appears in quadratic form:
educ 5 b0 1 b1motheduc 1 b2 fatheduc 1 b3abil 1 b4abil2 1 u.
Using the estimates b ˆ3 and b
ˆ
4, use calculus to find the value of abil, call it abil*,
where educ is minimized. (The other coefficients and values of parents’ education
variables have no effect; we are holding parents’ education fixed.) Notice that abil
is measured so that negative values are permissible. You might also verify that the
second derivative is positive so that you do indeed have a minimum.
(v) Argue that only a small fraction of men in the sample have “ability” less than the
value calculated in part (iv). Why is this important?
(vi) If you have access to a statistical program that includes graphing capabilities,
use the estimates in part (iv) to graph the relationship beween the predicted educa-
tion and abil. Let motheduc and fatheduc have their average values in the sample,
12.18 and 12.45, respectively.
Appendix 3A
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has
deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
114 Part 1 Regression Analysis with Cross-Sectional Data
Taking the partial derivatives with respect to each of the bj (see Appendix A), evaluating
them at the solutions, and setting them equal to zero gives
n
By the definition of the OLS residual u ˆi, since xˆi1 is just a linear function of the explana-
∑
n
ˆi1u
tory variables xi2, …, xik, it follows that i51 x ˆi 5 0. Therefore, equation (3.63) can be
expressed as
n
∑ rˆ (y 2 bˆ 2 bˆ x
i51
i1 i 0 1 i1
ˆ
2 … 2 b
k xik) 5 0. [3.64]
∑
n
ˆi1 are the residuals from regressing x1 on x2, …, xk, i51xij r
Since the r ˆi1 5 0, for all
∑
n
j 5 2, …, k. Therefore, (3.64) is equivalent to i51r ˆ
ˆ i1( yi 2 b
1xi1) 5 0. Finally, we use the
∑
n
ˆ ˆ
fact that i51 xi1r ˆ
i1 5 0, which means that b
1 solves
n
∑
r ˆ
ˆ (y 2 b ˆ ) 5 0.
r
i1 i 1 i1
i51
∑
n
Now, straightforward algebra gives (3.22), provided, of course, that i51
ˆ 2i1. 0; this is
r
ensured by Assumption MLR.3.
/
n n
ˆ1 5 b1 1
b
∑ rˆ u ∑ r ˆ .
i51
i1 i
i51
2
i1 [3.65]
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has
deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Chapter 3 Multiple Regression Analysis: Estimation 115
Now, under Assumptions MLR.2 and MLR.4, the expected value of each ui, given all
ˆi1 are just functions of the sample
independent variables in the sample, is zero. Since the r
independent variables, it follows that
/
n n
ˆ1uX) 5 b1 1
E(b ∑
i51
ˆi1E(uiuX)
r ∑ r ˆ
i51
2
i1
/
n n
5 b1 1
∑ rˆ · 0 ∑ rˆ 5 b
i51
i1
i51
2
i1 1,
where X denotes the data on all independent variables and E(b ˆ1uX) is the expected value
ˆ
of b
1, given xi1, …, xik, for all i 5 1, …, n. This completes the proof.
˜j|X) 5 E( b
E(b ˆj|X ) 1 E( b ˜
ˆk|X ) dj
˜ [3.67]
5 b 1 b d .
j k j
Equation (3.67) shows that ˜ bj is biased for bj unless bk 5 0—in which case xk has no
partial effect in the population—or d ˜ equals zero, which means that x and x are par-
j ik ij
tially uncorrelated in the sample. The key to obtaining equation (3.67) is equation (3.66).
To show equation (3.66), we can use equation (3.22) a couple of times. For simplicity,
we look at j 5 1. Now, b ˜1 is the slope coefficient in the simple regression of yi on r ˜ i1,
i 5 1, …, n, where the r ˜
i1 are the OLS residuals from the regression of xi1 on xi2, xi3, …,
˜1: i51 r ∑
n
xi,k21. Consider the numerator of the expression for b ˜i1yi. But for each i, we
can write yi 5 b ˆ0 1 b
ˆ1xi1 1 … 1 b ˆkxik 1 u
ˆi and plug in for yi. Now, by properties of the
OLS residuals, the ˜ri1 have zero sample average and are uncorrelated with xi2, xi3, …,
xi,k21 in the sample. Similarly, the u ˆi have zero sample average and zero sample correla-
tion with xi1, xi2, …, xik. It follows that the r ˜i1 and u
ˆi are uncorrelated in the sample (since
˜ i1 are just linear combinations of xi1, xi2, …, xi,k21). So
the r
n n n
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has
deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
116 Part 1 Regression Analysis with Cross-Sectional Data
/
n n
˜1 5 b
b ˆ ˆ
1 1 b
k
∑ r˜ x ∑r˜
i51
i1 ik
i51
2
i1,
ˆ1 1 b
5 b ˜1.
ˆk d
This is the relationship we wanted to show.
/
n n
/ /
n n n
∑
n
Now, since i51 ˆr is the sum of squared residuals from regressing x 1 on x2, …, xk,
2
i1
∑
n
ˆ2i15 SST1(1 2 R21). This completes the proof.
i51 r
˜1uX)
because E(uiuX) 5 0, for all i 5 1, …, n under MLR.2 and MLR.4. Therefore, for E(b
to equal b1 for any values of the parameters, we must have
n n n
∑w
i51
i1 5 0, ∑w x
i51
i1 i1 5 1, ∑w x
i51
i1 ij 5 0, j 5 2, …, k. [3.69]
ˆi1 be the residuals from the regression of xi1 on xi2, …, xik. Then, from (3.69), it
Now, let r
follows that
n
∑ w r 5 1
i51
i1ˆ i1 [3.70]
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has
deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Chapter 3 Multiple Regression Analysis: Estimation 117
∑
n
because xi1 5 ˆxi1 1 ˆri1 and i51
wi1ˆxi1 5 0. Now, consider the difference between
˜1uX) and Var(b
Var(b ˆ1uX) under MLR.1 through MLR.5:
/
n n
s2 ∑w 2 s ∑ rˆ .
2
i1
2 2
i1 [3.71]
i51 i51
/
n n 2 n
∑ w 2 ∑w rˆ ∑ rˆ .
2
i1 i1 il
2
i1 [3.72]
i51 i51 i51
∑
g1 5 i51wi1r
where ˆ
n
/∑
ˆil i51 r
n
ˆ 2i1, as can be seen by squaring each term in (3.73),
summing, and then canceling terms. Because (3.73) is just the sum of squared residu-
als from the simple regression of wi1 on r ˆi1—remember that the sample average of r
ˆi1 is
zero—(3.73) must be nonnegative. This completes the proof.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has
deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.