0% found this document useful (0 votes)

9 views47 pages

Chapter 2

Uploaded by

issy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views47 pages

Chapter 2

Uploaded by

issy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

The Simple Linear Regression Model

Dr. Randall R. Rojas

Contents
1 Univariate Characterizations 1
1.1 Statistical Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Quantile Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Bivariate Characterizations 9
2.1 Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Jittering Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 The Method of Least Squares for the Simple Linear Regression 13

3.1 An Econometric Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.1 Data Generating Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.2 The Random Error and Strict Exogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.3 The Regression Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.4 Random Error Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.5 Variation in x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.6 Error Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.7 Error Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Estimating the Regression Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 The Gauss-Markov Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Distribution of the Least Squares Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5 Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5.1 Boostrapping Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4 Estimating Nonlinear Relationships 29

4.1 Logarithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Power Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Transforming Restricted-Range Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4 Transformations to Equalize Spread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5 Transformations Toward Linearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5 Regressions with Indicator Variables 45

1 Univariate Characterizations
In a regression setting we first start by looking at a univariate characterization of the data. This entails
looking at the respective statistical summaries, features, and distributions of all the variables we have.

1
Among the popular univariate graphical tools, we will consider histograms, density plots, qqplots, and
boxplots.

1.1 Statistical Summary

Looking at the summary statistics of our data allows to identify any potential issues we may encounter when
building our regression model. For example, there may NAs, values outside the expected range, etc. Below
are two examples using the pastecs and psych libraries.
library(pastecs)
# Summary for 1 variable:
stat.desc(rnorm(100),norm=TRUE)

## nbr.val nbr.null nbr.na min max range

## 100.0000000 0.0000000 0.0000000 -2.9105814 2.8800428 5.7906242
## sum median mean SE.mean CI.mean.0.95 var
## -20.1869404 -0.2160320 -0.2018694 0.1042231 0.2068013 1.0862461
## std.dev coef.var skewness skew.2SE kurtosis kurt.2SE
## 1.0422313 -5.1628989 0.1544150 0.3198589 0.1519546 0.1588383
## normtest.W normtest.p
## 0.9950863 0.9773210
# Summary for an entire datacube:
library(car)
attach(Prestige)
#stat.desc(Prestige, norm=TRUE)

library(psych)
describe(Prestige)

## vars n mean sd median trimmed mad min max

## education 1 102 10.74 2.73 10.54 10.63 3.15 6.38 15.97
## income 2 102 6797.90 4245.92 5930.50 6161.49 3060.83 611.00 25879.00
## women 3 102 28.98 31.72 13.60 24.74 18.73 0.00 97.51
## prestige 4 102 46.83 17.20 43.60 46.20 19.20 14.80 87.20
## census 5 102 5401.77 2644.99 5135.00 5393.87 4097.91 1113.00 9517.00
## type* 6 98 1.79 0.80 2.00 1.74 1.48 1.00 3.00
## range skew kurtosis se
## education 9.59 0.32 -1.03 0.27
## income 25268.00 2.13 6.29 420.41
## women 97.51 0.90 -0.68 3.14
## prestige 72.40 0.33 -0.79 1.70
## census 8404.00 0.11 -1.49 261.89
## type* 2.00 0.40 -1.36 0.08
# Note: The flag norm allows us to also include normal
# distribution statistics such as skewness, kurtosis, etc.

1.2 Histograms

Histograms should be one our first analysis tool since they can help us learn about each variable’s properties.
Things to look for:
• Outliers: This can influence the regression fit in many ways

2
• Shape: Is the distribution symmetric, skewed, etc?
• Range: Is the range of values very large? For example several orders of magnitude?
• Density of observations and/or clustering features. Are the data all clustered in certain regions? In high density
regions your regression fit may be robust but then become unstable in less dense ones.
The near optimal number of bins to use based on the number of observations n is given by:
k = 1 + log2 (n)
However, more sample-specific formulas exist such as Freedman’s and Diaconis (FD)’ suggested:
n1/3 (max − min)
2( Q3 − Q1 )
For example, we can compare the two methods for 2 different data sets:
#Compare histograms using k vs FD:
#quartz()
par(mfrow=c(2,2))
n = 1000;k = 1 + log2(n)
hist(rnorm(n),breaks = k,col="skyblue4", main = "Normal Distr. Using k ")
hist(rnorm(n),breaks = "FD",col="skyblue4", main = "Normal Distr. Using FD")
n = 1000;k = 1 + log2(n)
hist(rexp(n,0.5),breaks = k,col="skyblue2", main ="Exp Distr. Using k")
hist(rexp(n,0.5),breaks = "FD",col="skyblue2", main ="Exp Distr. Using FD")

Normal Distr. Using k Normal Distr. Using FD

350

80
Frequency

Frequency

40
0 150

−4 −2 0 2 4 −4 −3 −2 −1 0 1 2 3

rnorm(n) rnorm(n)

Exp Distr. Using k Exp Distr. Using FD

Frequency

Frequency
200

100
0

0 2 4 6 8 10 12 0 5 10 15

rexp(n, 0.5) rexp(n, 0.5)

1.3 Density Estimation

Histograms are very informative but being able to visualize the density plot, along with the histogram and
1D scatterplot (rug-plot) can be more informative. A common kernel-density (nonparametric) estimate used
is given by:

3
n
x − xi

1
pb( x ) =
nh ∑K h
i =1

where K is kernel function (e.g., Normal distribution), h is the bandwidth (amount of smoothness), and n the
number of observations.
#quartz()
n=1000
hist(rexp(n,0.5),breaks ="FD",col="skyblue2", freq = FALSE, ylab = "Density")
lines(density(rexp(n,0.5)),lwd = 2, col ="red")
lines(density(rexp(n,0.5), adjust =0.5),lwd = 2, col = "blue", type ='l', lty=2)
rug(rexp(n,0.5))

Histogram of rexp(n, 0.5)

0.4
0.3
Density

0.2
0.1
0.0

0 2 4 6 8 10 12

rexp(n, 0.5)
#You can also show the density with color
library(ggplot2)
data=data.frame(value=rnorm(10000))
ggplot(data, aes(x=value))+geom_histogram(binwidth = 0.2, aes(fill = ..count..) )

4
800

600

count

600
count

400
400

200

0
200

−4 −2 0 2 4
value

1.4 Quantile Plots

These plots are very useful for comparing the distribution of a variable to a theoretical reference distribution.
By default the Normal distribution is used in the function ‘qqPlot’ (Quantile - Quantile) but we can provide
any other distribution.
#The dataset Prestige from the car package has 102 observations
library(car)
attach(Prestige)
qqPlot(~ income, data=Prestige, id=list(n=3))

5
25000
general.managers
physicians

lawyers
15000
income

5000
0

−2 −1 0 1 2

norm quantiles
## general.managers physicians lawyers
## 2 24 17
qqPlot(income ~ type, data=Prestige, layout=c(1, 3))

type = bc type = prof type = wc

general.managers
25000

25000

25000
physicians
20000

20000

20000
15000

15000

15000
income

income

income
10000

10000

firefighters commercial.travellers
insurance.agents
5000

5000

farm.workers

−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2

norm quantiles norm quantiles norm quantiles

# Type = Professional (Prof), White Collar (WC), "Blue Collar (BC)"

6
Studentized Residuals(lm(prestige ~ income + education + type, data = Duncan))
qqPlot(lm(prestige ~ income + education + type, data=Duncan),
envelope=.99)
4

minister
3

machinist
2
1
0
−1

−2 −1 0 1 2

t Quantiles

## minister machinist
## 6 28

1.5 Boxplots

Boxplots allow us know more quantitative aspects of a variables distribution such as the min, max, Q1,
Q3, IQR, and Median. Another helpful property is that they are very convenient when comparing many
distributions simultaneously and/or conditioning on other variables.
#Boxplot of Income
Boxplot(~income, data=Prestige)

7
25000
general.managers
physicians

lawyers
osteopaths.chiropractors
15000
income

veterinarians
5000
0

## [1] "general.managers" "lawyers"

## [3] "physicians" "veterinarians"
## [5] "osteopaths.chiropractors"
# Note: For more choices of boxplots, look at: https://www.r-graph-gallery.com/boxplot/

# These are known as Parallel Boxplots

Boxplot(Ornstein$interlocks~Ornstein$nation)

2
100

3
1
Ornstein$interlocks

9
6
5
60
40

13
27
20
0

CAN OTH UK US

Ornstein$nation
## [1] "1" "2" "3" "5" "6" "9" "13" "27"
# Note: interlocks = Number of interlocking director and executive positions
# shared with other major firms.

library("plotrix")
means <- Tapply(interlocks ~ nation, mean, data=Ornstein)
sds <- Tapply(interlocks ~ nation, sd, data=Ornstein)

8
plotCI(1:4, means, sds, xaxt="n", xlab="Nation of Control",
ylab="interlocks", ylim=range(Ornstein$interlocks))
lines(1:4, means)
axis(1, at=1:4, labels = names(means))
100
80
interlocks

60
40
20
0

CAN OTH UK US

Nation of Control

2 Bivariate Characterizations
Once you have the univariate description of the variables, we can then exam the pairwise associations via
e.g., scatterplots

2.1 Scatterplots

library(car)
attach(Prestige)
scatterplot(prestige ~income, lwd=3, id=TRUE)

9
when looking at
graphs, look fo
relationship = r line and shap
positive trend e
too much discre
ptancy betwee
model is not a n line and shap
good fit e of plots sugg
ests that the
24

80
60 2
prestige

40
20

0 5000 10000 15000 20000 25000

income

## [1] 2 24
#Note: Lowess smoother is a 'locally weighted smoother'

#We may want to condition on other variables, for example, on type:

scatterplot(prestige ~income|type, lwd=3,col=c("blue", "green", "red"),id=TRUE)

type
bc prof wc

24
20 30 40 50 60 70 80

31 2
prestige

82
58
51

5000 10000 15000 20000 25000

income

10
## 31 58 51 82
## 1 2 3 19 24 25
#If we have several predictors in our model, it is more practical to
# simply look at as scatterplot matrix of the data. For example:
# scatterplotMatrix(~ prestige + income + education + women)

#We can also look at all the variables conditioned on type:

scatterplotMatrix(~ prestige + income + education + women | type,smooth=FALSE, ellipse=list(levels=0.5))

#We can also look at 3D visualizations

scatter3d(prestige ~ income + education)
axis(1, at=1:4, labels = names(means))
5000 15000 25000 0 20 40 60 80

prestigebc

80
prof

50
wc

20
20000

income
5000

education

14
10
6
women
80
40

CAN
0

20 40 60 80 6 8 10 12 14 16

2.2 Jittering Scatterplots

Jittering the data (adding noise) helps separate overplotted observations

plot(Vocab$vocabulary~Vocab$education)

11
10
8
Vocab$vocabulary

6
4
2
0

0 5 10 15 20

Vocab$education
#Add some random noise to the variables --> jitter the data
plot(jitter(Vocab$vocabulary)~jitter(Vocab$education))
10
jitter(Vocab$vocabulary)

8
6
4
2
0

0 5 10 15 20

jitter(Vocab$education)
# Increase the amount of jitter
plot(jitter(Vocab$vocabulary,factor =2)~jitter(Vocab$education, factor =2),col="gray",cex=0.5)
abline(lm(Vocab$vocabulary~ Vocab$education),lwd= 3, lty="dashed")
lines(lowess(Vocab$education,Vocab$vocabulary,f=0.2), lwd =3, col="red")

12
jitter(Vocab$vocabulary, factor = 2)

10
8
6
4
2
0

0 5 10 15 20

jitter(Vocab$education, factor = 2)

3 The Method of Least Squares for the Simple Linear Regression

Economic theory suggests many relationships between economic variables.A regression model is helpful in
questions such as the following: if one variable changes in a certain way, by how much will another variable
change? For example, by how much do you expect hourly wages to increase with an additional year of
education?
• The regression model is based on assumptions about the conditional pdf of y (e.g., hourly wages) given
x (e.g., years of education).
– The continuous random variable y has a probability density function (pdf).
– The pdf is a conditional probability density function because it is “conditional” upon an x*
• For example, households with an income of $1000 per week would have various food expenditure per
person for a variety of reasons.
• The pdf f (y) describes how expenditures are distributed over the population.
• This is a conditional pdf because it is “conditional” upon household income.
• The conditional mean, or expected value, of y is E [y| x = $1000] = µy| x and is our population’s mean
weekly food expenditure per person.
• The conditional variance of y is Var [y| x = $1000] = σ2 which measures the dispersion of household
expenditures y about their mean.
• The parameters µy| x and σ2 , if they were known, would give us some valuable information about the
population we are considering.
• In order to investigate the relationship between expenditure and income we must build an economic
model and then a corresponding econometric model that forms the basis for a quantitative or empirical
economic analysis.
• This econometric model is also called a regression model.

13
Table 1: Terminology:
y x
Dependent Variable Independent Variable
Explained Variable Explanatory Variable
Response Variable Control Variable
Predicted Variable Predictor Variable
Regressand Regressor

3.1 An Econometric Model

The Econometric Model is the motivation, e.g., it could be a supply equation, therefore, it comes from economic
theory. The parameters of the model are also known as the population parameters. For example, let’s
examine the relationship between weekly food expenditure and weekly household income.
• A household spends $80 plus 10 cents of each dollar of income received on food.
• Algebraically the rule is y = 80 + 0.10x where y = weekly household food expenditure ($) and x =
weekly household income ($).
• In reality, many factors may affect household expenditure on food.
• Let e (error term) = everything else affecting food other than income.
• y = β 1 + β 2 x + e. This equation is the simple regression model.
• A simple linear regression analysis examines the relationship between a y-variable and one x-variable.

R Example
library(POE5Rdata)
# Histograms
data("food")
hist(food$income,col="skyblue3", main = "PDF of Weekly Income = f(x)")

14
PDF of Weekly Income = f(x)
12
10
8
Frequency

6
4
2
0

0 5 10 15 20 25 30 35

food$income
hist(food$food_exp, col="skyblue3", main = "PDF of Weekly Food Expennditure = f(y|x)")

PDF of Weekly Food Expennditure = f(y|x)

7
6
5
Frequency

4
3
2
1
0

100 200 300 400 500 600

food$food_exp
hist(food$food_exp[food$income<=20],col="skyblue3", main = "PDF of f(y|x<2000)")

15
PDF of f(y|x<2000)
5
4
Frequency

3
2
1
0

100 150 200 250 300 350 400

food$food_exp[food$income <= 20]

hist(food$food_exp[food$income>20],col="skyblue3", main = "PDF of f(y|x>2000)")

PDF of f(y|x>2000)
6
5
4
Frequency

3
2
1
0

100 200 300 400 500 600

food$food_exp[food$income > 20]

#We can also look at the estimate of the errors, i.e., the residuals
ols.mod <- lm(food_exp~income, data=food)
plot(food$income, ols.mod$residuals, pch=20,ylab="Resdiduals", xlab="WeeklyIncome ($100s)")

16
abline(h=0,col="red", lwd=2)

200
100
Resdiduals

0
−100
−200

5 10 15 20 25 30

WeeklyIncome ($100s)
# Note: From the residuals plot we can see that the variance is not constant

3.1.1 Data Generating Process

The Simple Linear Regression (SLR) model is given by:

y i = β 1 + β 2 x i + ei

In R (and most other software packages) the default null hypothesis tested is H0 : β k = 0 against the
alternative H1 : β k 6= 0, k = 1, 2. Therefore, if the respective estimated p-value is small (p−value < α), we
reject H0 and conclude that the respective parameter is statistically significantly different from 0.
The Simple Linear Regression (SR) assumptions of the model are that:
1. *Linearity:* The value of y, for each value of x, is y = β 0 + β 1 x + e
2. E[ei | xi ] = 0 which is the same as E[y] = β 0 + β 1 x
3. *Homoskedasticity:* Var[ei | xi ] = Var[y] = σ2 (constant variance condition, i.e., the errors are *ho-
moskedastic*)
4. cov(yi , y j ) = cov(ei , e j | x ) = 0 (even better if the errors are independent which is a stronger condition)
5. *Rank Condition:* x is not random and takes on at least two different values
6. *Normality (Optional)*: The errors are normally distributed, i.e., ei | xi ∼ N (0, σ2 )
Note: Since the data {( x1 , y1 ), . . . , ( xn , yn )} is random sample, both x’s and y’s are independent and identi-
cally distributed (iid).

17
3.1.2 The Random Error and Strict Exogeneity

• The second assumption of the simple regression model concerns the “everything else” term e.
• Unlike food expenditure and income, the random error term is not observable; it is unobservable.
• The x-variable, income, cannot be used to predict the value of e.
• E [ei | xi ] = 0 has two implications:
– E [ ei ] = 0
– cov (ei , xi ) = 0
To better understand the error term we can think of decomposing the observation y into 2 components:
1. E[y| x ] = β 1 + β 2 x, which represents the systematic component of y, and
2. e = y − E[y| x ] = y − β 1 − β 2 x, which represents the random component and thus, encapsulates
everything else that contributes to the variations in y. Therefore,

y = β1 + β2 x + e
|{z} .
|{z} | {z }
Data Model Uncertainty

R Example
library(POE5Rdata)
data("food")
# Test the covariance assumption
ols.mod <- lm(food_exp~income, data=food)
cov(food$income, ols.mod$residuals) #Seems close enough to 0

## [1] -1.166019e-14
# Test E[e] = 0
mean(ols.mod$residuals) #Seems close enough to 0

## [1] -4.973799e-15

3.1.3 The Regression Function

The conditional expectation E [yi | xi ] = β 1 + β 2 xi is called the regression function.

This also says given a change in x, ∆x, the resulting change in E [yi | xi ] is β 2 ∆x holding all else constant.
We can say that a change in x leads to, or causes, a change in the expected (population average) value of y
given xi , E [yi | xi ]

18
• The fitted regression line is
ybi = b1 + b2 xi
and the residual is
ei = yi − ybi = yi − b1 − b2 xi
b

• The b’s are chosen to minimize

∑ bei2 = ∑ (yi − ybi )2 = ∑ (yi − b1 − b2 xi )2

i i i

• The First Order Condition (FOC) of minimization is

0 = ∑ (yi − b1 − b2 xi ) = ∑ bei
i i
0 = ∑ xi (yi − b1 − b2 xi ) = ∑ xi bei
i i

• We will call the estimators given by

∑ ( xi − x ) ( yi − y )
b2 = 2
∑ ( xi − x )
b1 = y − b2 x

the ordinary least squares estimators. “Ordinary least squares” is abbreviated as OLS.
• It can be shown that
N
b2 = ∑ wi y i , where,
i =1

( xi − x )
wi = 2
∑ ( xi − x )

• It could also be written as

N
b2 = β 2 + ∑ wi ei
i =1

based on which we obtain

19
E [b2 ] = β2

i.e., it is unbiased
• It can further be shown that

σ2
Var (b2 ) = 2
∑ ( xi − x )
∑ xi2
Var (b1 ) = σ2
N ∑ ( x i − x )2
x
cov (b1 , b2 ) = − σ2
N ∑ ( x i − x )2

• Note that
1. The larger the variance term, the greater the uncertainty there is in the statistical model, and the
larger the variances and covariance of the least squares estimators.
2. The larger the sum of squares ∑ ( xi − x )2 , the smaller the variances of the least squares estimators
and the more precisely we can estimate the unknown parameters.
3. The larger the sample size N, the smaller the variances and covariance of the least squares
estimators.

3.1.4 Random Error Variation

Ideally, the conditional variance of the random error is constant.

Var [ei | xi ] = σ2 This is the homoskedasticity assumption.
Assuming the population relationship yi = β 1 + β 2 xi + ei the conditional variance of the dependent variable
is

Var [yi | xi ] = Var [ β 1 + β 2 xi + ei | xi ] = Var [ei | xi ] = σ2

If this assumption is violated, and Var [ei | xi ] may depend on xi , then the random errors are said to be
heteroskedastic.
R Example
library(POE5Rdata)
data("food")
#A plot of the estimate of the errors, i.e., the residuals, shows that they are heteroskedastic.
ols.mod <- lm(food_exp~income, data=food)
plot(food$income, ols.mod$residuals, pch=20,ylab="Resdiduals", xlab="WeeklyIncome ($100s)")
abline(h=0,col="red", lwd=2)

20
200
100
Resdiduals

0
−100
−200

5 10 15 20 25 30

WeeklyIncome ($100s)
# Note: From the residuals plot we can see that the variance is not constant

3.1.5 Variation in x

In a regression analysis, one of the objectives is to estimate β 2 = ∆E [yi | xi ] /∆x.

• If we are to hope that a sample of data can be used to estimate the effects of changes in x, then we must
observe some different values of the explanatory variable x in the sample.
• The minimum number of x-values in a sample of data that will allow us to proceed is two.
• β 2 in this case represents the slope, which tells us by how much does the dependent variable y change
if the independent variable x is increased by 1 unit.

21
3.1.6 Error Normality

• It is not at all necessary for the random errors to be conditionally normal in order for regression analysis
to “work”.
• When samples are small, it is advantageous for statistical inferences that the random errors, and
dependent variable y, given each x-value, are normally distributed.
• Central Limit Theorem says roughly that collections of many random factors tend toward having a
normal distribution.

3.1.7 Error Correlation

• It is possible that there are correlations between the random error terms.
• With cross-sectional data, data collected at one point in time, there may be a lack of statistical indepen-
dence between random errors for individuals who are spatially connected.
• Within a larger sample of data, there may be clusters of observations with correlated errors because of
the spatial component.
• The starting point in regression analysis is to assume that there is no error correlation.

3.2 Estimating the Regression Parameters

• We can use the sample data values of yi and xi from the food data set, to estimate the unknown
regression parameters β 1 and β 2 .
• These parameters represent the unknown intercept and slope coefficients for the food expenditure–
income relationship.
• If we represent the 40 data points as (yi , xi ), i = 1, . . . , 40, and plot them, we obtain the scatter diagram
below.
• Our problem is to estimate the location of the mean expenditure line.
• We would expect this line to be somewhere in the middle of all the data points.

22
23
R Example
library(POE5Rdata)
data("food")
ols.mod <- lm(food_exp~income, data=food)
plot(food$income,food$food_exp, pch=20,
xlab="Weekly Income ($100s)", ylab="Weekly Food Expenditure ($)")
abline(ols.mod,lwd=2,col="blue")
legend(6, 550, legend="Linear Relatonship Model: Mean Expennditure Line",col=c("blue"),lty=1, cex=0.8)
600
Weekly Food Expenditure ($)

Linear Relatonship Model: Mean Expennditure Line

500
400
300
200
100

5 10 15 20 25 30

Weekly Income ($100s)

R Example
library(POE5Rdata)
# First we look at the histograms
data("food")

ols.mod <- lm(food_exp~income, data=food)

summary(ols.mod)

##
## Call:
## lm(formula = food_exp ~ income, data = food)
##
## Residuals:
## Min 1Q Median 3Q Max
## -223.025 -50.816 -6.324 67.879 212.044
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 83.416 43.410 1.922 0.0622 .
## income 10.210 2.093 4.877 1.95e-05 ***
## ---

24
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 89.52 on 38 degrees of freedom
## Multiple R-squared: 0.385, Adjusted R-squared: 0.3688
## F-statistic: 23.79 on 1 and 38 DF, p-value: 1.946e-05
names(ols.mod)

## [1] "coefficients" "residuals" "effects" "rank"

## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
ols.mod$coefficients[1]

## (Intercept)
## 83.416
ols.mod$coefficients[2]

## income
## 10.20964
# If you just want the values: ols.mod$coefficients[[1]], ols.mod$coefficients[[2]]

3.3 The Gauss-Markov Theorem

• Under our assumptions of the linear regression model, the estimators b1 and b2 have the smallest
variance of all linear and unbiased estimators of b1 and b2 . They are the Best Linear Unbiased estimators
(BLUE) of β 1 and β 2 .
• The estimators b1 and b2 are best within their class because they have the minimum variance. When
comparing two linear and unbiased estimators, we always want to use the one with the smaller
variance, because that estimation rule gives us the higher probability of obtaining an estimate that is
close to the true parameter value.
• For this to be the case, all of our assumptions should be true, except for the normality. For example, we
need homoskedasticity.

3.4 Distribution of the Least Squares Estimators

• If we make the normality assumption, we get

!
σ2
b2 | x ∼ N β2 , 2
∑ ( xi − x )
!
2 ∑ xi2
b1 | x ∼ N β1 , σ
N ∑ ( x i − x )2

• A (graduate level version of the) central limit theorem says that the above is approximately correct
even if the errors are not normally distributed, as long as N is sufficiently large.
• We can estimate σ2 by

∑iN=1 b
ei2
σ2 =
N−2
b

25
where

ei
b = yi − ybi = yi − b1 − b2 xi

σ2 is unbiased for σ2 .
• It can be shown that b
• Note that the denominator is N − 2, not N.
• We can then construct

σ2
b
d (b2 )
Var = 2
∑ ( xi − x )
∑ xi2
d (b1 )
Var = σ2
N ∑ ( x i − x )2
b

x
ov (b1 , b2 )
cd = −bσ2
N ∑ ( x i − x )2

• We often write
q
se (b2 ) = d (b2 )
Var
q
se (b1 ) = d (b1 )
Var

R Example
library(POE5Rdata)
data("food")

ols.mod <- lm(food_exp~income, data=food)

vcov(ols.mod)

## (Intercept) income
## (Intercept) 1884.44226 -85.903157
## income -85.90316 4.381752
# If you just want the values: ols.mod$coefficients[[1]], ols.mod$coefficients[[2]]

3.5 Random Sampling

3.5.1 Boostrapping Regression Models

In general, for any regression model estimated, it is a good idea to also include bootstrap estimates as way to
test for the robustness of the estimates. The core principle being simple random sampling with replacement
from the original dataset, and for each sample estimating the same regression model and then collecting the
usual statistics.
R Example
library(boot)
library(car)
set.seed(3435)
library(POE5Rdata)
data("food")
trans.mod = lm(food_exp~income, data=food)

26
betahat.boot = Boot(trans.mod, R=999)
usualEsts = summary(trans.mod)$coef[, 1:2]
summary(betahat.boot)

##
## Number of bootstrap replications R = 999
## original bootBias bootSE bootMed
## (Intercept) 83.416 -0.487805 31.391 81.858
## income 10.210 0.041439 1.974 10.241
confint(betahat.boot)

## Bootstrap bca confidence intervals

##
## 2.5 % 97.5 %
## (Intercept) 21.42506 148.31179
## income 6.49895 14.45268
hist(betahat.boot)
Normal Density
Kernel Density
bca 95% CI
Obs. Value
0.015

0.20
0.010
Density

Density

0.10
0.005
0.000

0.00

−50 50 150 250 0 5 10 15 20

(Intercept) income
# Bootstrap for the estimated residual standard deviation:
sigmahat.boot <- Boot(trans.mod, R=199, f=sigmaHat, labels="sigmaHat")
summary(sigmahat.boot)

## R original bootBias bootSE bootMed

## sigmaHat 199 89.517 -3.2486 8.5707 87.046
confint(sigmahat.boot)

## Bootstrap bca confidence intervals

27
## 2.5 % 97.5 %
## sigmaHat 77.43262 109.2678
R Example
library(boot)
# function to obtain R-Squared from the data
rsq <- function(formula, data, indices) {
d <- data[indices,] # allows boot to select sample
fit <- lm(formula, data=d)
return(summary(fit)$r.square)
}
library(POE5Rdata)
data("food")
results <- boot(data=food, statistic=rsq, R=1000, formula=food_exp~income)
ci1=boot.ci(results, type="bca", index=1,conf=0.95)
ci_rsq = ci1$bca[ , c(4, 5)]

hist(results$t[,1], main = 'Coefficient of Determination:food_exp ~ income', xlab = 'R- Squared', col =

lines(density(results$t[,1]), col = 'blue',lwd=2)
abline(v = ci_rsq, col = 'red',lwd=3,lty=3)
abline(v=mean(results$t[,1]), col='red',lwd=3,lty=1)

Coefficient of Determination:food_exp ~ income

5
4
3
Density

2
1
0

0.0 0.2 0.4 0.6

R− Squared

28
4 Estimating Nonlinear Relationships
When the data are not suitable for a linear model, we can consider non-linear transformations on y and/or x.
In econometrics 4 very popular non-linear models are the (1) log-log, (2) log-linear, (3) linear-log, and (4)
quadratic models. The table below shows a summary of these models and their respective interpretations.

Recall that the elasticity, ε, can be computed as ε = slope × yx . For example, a quadratic model for the house
prices such as

PRICE = α1 + α2 SQFT 2 + e

has a slope given by

d PRICE
\
= 2b
α2 SQFT
dSQFT

and thus, the respective elasticity, is given by

SQFT
ε = 2b
α2 SQFT .
PRICE
\

As an example, let’s estimate the quadratic regression model above.

R Example
library(POE5Rdata)
# Home Prices (y=price) vs. Living Area (x=sqft)
data(br)
reg.mod <- lm(price~I(sqftˆ2), data=br)
S(reg.mod)

29
## Call: lm(formula = price ~ I(sqft^2), data = br)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.578e+04 2.890e+03 19.30 <2e-16 ***
## I(sqft^2) 1.542e-02 3.131e-04 49.25 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard deviation: 68210 on 1078 degrees of freedom
## Multiple R-squared: 0.6923
## F-statistic: 2426 on 1 and 1078 DF, p-value: < 2.2e-16
## AIC BIC
## 27110.35 27125.30
b1 <- coef(reg.mod)[[1]]
b2 <- coef(reg.mod)[[2]]
plot(br$sqft, br$price, xlab="Area (ftˆ2)", ylab="Home Price ($)",pch=20)
curve(b1+b2*xˆ2, col="red", add=TRUE,lwd=2)
1000000 1500000
Home Price ($)

500000
0

2000 4000 6000 8000

Area (ft^2)
# Q1: Is the quadratic fit a good one?
# A1: From the scatterplot we can see that for
# homes with sqft>4000, the spread increases and the number
# of observations decreases.
# This suggests that our fit will worsen as we consider
# larger values of x beyond ~ 4000ftˆ2.
#Q2: Are the residuals normally distributed?
hist(reg.mod$residuals,breaks ="FD",col="skyblue2", freq = FALSE, ylab = "Density",
main = "Histogram of the Residuals")
lines(density(reg.mod$residuals),lwd = 2, col ="red")

30
Histogram of the Residuals
0.0e+00 4.0e−06 8.0e−06 1.2e−05
Density

−6e+05 −2e+05 0e+00 2e+05 4e+05 6e+05

reg.mod$residuals
library(tseries)
jarque.bera.test(reg.mod$residuals)

##
## Jarque Bera Test
##
## data: reg.mod$residuals
## X-squared = 61385, df = 2, p-value < 2.2e-16
# We can also compute traditional metrics such as predicted
# values and elasticities
sqftx=c(2000, 4000, 6000) #given values for sqft
pricex=b1+b2*sqftxˆ2 #prices corresponding to given sqft
DpriceDsqft <- 2*b2*sqftx # marginal effect of sqft on price
elasticity=DpriceDsqft*sqftx/pricex
cat("DpriceDsqft (x =2000, 4000, 6000) =", DpriceDsqft)

## DpriceDsqft (x =2000, 4000, 6000) = 61.68521 123.3704 185.0556

cat("elasticity (x =2000, 4000, 6000) =", elasticity)

## elasticity (x =2000, 4000, 6000) = 1.050303 1.631251 1.817408

R Example
# Weekly Food Expenditure (y=food_exp in $) vs. Income (x=income in units of $100)
# We will consider a Linear-Log Model: y = beta_0 + beta_1 log(x)
#library(PoEdata)
data(food)
head(food)

## food_exp income

31
## 1 115.22 3.69
## 2 135.98 4.39
## 3 119.34 4.75
## 4 114.96 6.03
## 5 187.05 12.47
## 6 243.92 12.98
plot(food$income, food$food_exp, xlab="Income ($100)", ylab="Food Expenditure ($)",pch=20)
reg.mod <- lm(food$food_exp~log(food$income))
S(reg.mod)

## Call: lm(formula = food$food_exp ~ log(food$income))

##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -97.19 84.24 -1.154 0.256
## log(food$income) 132.17 28.80 4.588 4.76e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard deviation: 91.57 on 38 degrees of freedom
## Multiple R-squared: 0.3565
## F-statistic: 21.05 on 1 and 38 DF, p-value: 4.76e-05
## AIC BIC
## 478.83 483.90
y=predict(reg.mod,interval="confidence")
matlines(food$income,y,lwd=2,col="blue")
600
500
Food Expenditure ($)

400
300
200
100

5 10 15 20 25 30

Income ($100)
#Interpretation 1: For a household with $1000 weekly income,
# we expect an additional weekly food spending of $13.22
# for an additional $100 increase in weekly income.
#Interpretation 2: For a 1% increase in income,

32
# we expect food expenditure to increase by $1.32/week.
plot(food$income,reg.mod$residuals,pch=20, ylab="Residuals", xlab="Income")
abline(h=0, lwd=2, col="red")
200
100
Residuals

0
−200 −100

5 10 15 20 25 30

Income
#Q: Are the residuals normally distributed?
hist(reg.mod$residuals,breaks ="FD",col="skyblue2", freq = FALSE, ylab = "Density",
main = "Histogram of the Residuals",xlab="Residuals")
lines(density(reg.mod$residuals),lwd = 2, col ="red")

Histogram of the Residuals

0.004
Density

0.002
0.000

−200 −100 0 100 200

Residuals

33
library(tseries)
jarque.bera.test(reg.mod$residuals)

##
## Jarque Bera Test
##
## data: reg.mod$residuals
## X-squared = 0.19989, df = 2, p-value = 0.9049

4.1 Logarithms

In econometrics logarithms are very useful when dealing with variables that have span different orders
of magnitude yet must be included in the regression model. For example consider a model that includes
the predictors GDP ( 1010 ) and real interest rates ( 10−2 ). Logarithms can help us mitigate violations to the
constant variance assumption often used in Least Squares. By default R uses natural logs when you use the
log function.
# Here we look at the Ornstein data which includes financial assets of 248 Canadian companies.
par(mfrow=c(1, 2), mar=c(5, 4, 6, 2) + 0.1)
densityPlot(~ assets, data=Ornstein, xlab="assets", main="Untransformed")
densityPlot(~ log10(assets), data=Ornstein, adjust=0.65,
xlab=expression(log[10]~"(assets)"), main="Log Plot")
basicPowerAxis(0, base=10, side="above", at=10ˆ(2:5),
axis.title="")

Untransformed Log Plot

100 10000
0.8
0.00030

0.6
Density

Density
0.00015

0.4
0.2
0.00000

0.0

0 50000 150000 2 3 4 5

assets log10 (assets)

scatterplot(infantMortality ~ ppgdp, data=UN, xlab="GDP per Capita",
ylab="Infant Mortality Rate (per 1000 births)", main="Untransformed")

34
Untransformed

Infant Mortality Rate (per 1000 births)

100 120
80
60
40
20
0

0e+00 2e+04 4e+04 6e+04 8e+04 1e+05

GDP per Capita

scatterplot(infantMortality ~ ppgdp, data=UN, xlab="GDP per capita",

ylab="Infant Mortality Rate (per 1000 births)", main="Log-Log Plot",
log="xy", id=list(n=3))

Log−Log Plot
Infant Mortality Rate (per 1000 births)

100

Angola Equatorial Guinea

Gabon
20
10
5
2

1e+02 5e+02 5e+03 5e+04

GDP per capita

35
## Angola Equatorial Guinea Gabon
## 4 54 62

4.2 Power Transformations

Power transformations of the predictors and/or response variable(s) can help improve their respective
distributions for modeling/estimation purposes (e.g., Least Squares estimates of regression coefficients) and
sampling/robustness (e.g., Bootstrapping and Bayesian inference). A popular transformation is the Box-Cox
scale power transformation given by:

x λ −1

TBC ( x, λ) = x (λ)
= λ when λ 6= 0;
loge x when λ = 0.
The R function symbox automatically produces boxplots of the variable to transform for different values of λ
to quickly gauge which power-law is more appropriate. For example, we can use this function on the Per
Person GDP (ppgdp) as follows:
library(car)
symbox(~ppgdp,data=UN)
Transformations of ppgdp

−1 −0.5 log 0.5 1

Powers
# Or we can look at income from the Prestige dataset
symbox(~ income, data=Prestige)

36
Transformations of income

−1 −0.5 log 0.5 1

Powers
Often times our variables take on negative values (e.g., S&P 500 returns), yet transformations such as logs
would not work. Instead we can use the Yeo-Johnson transformations TBC ( x + 1, λ) for nonnegative values
of x and TBC (− x + 1, 2 − λ) for negative values of x.
# The function yjPower takes as inputs your variable (x) and lambda,
# and outputs the respective transformed variable.
x=-5:5
yjPower(x, 3)

## [1] -0.8333333 -0.8000000 -0.7500000 -0.6666667 -0.5000000 0.0000000

## [7] 2.3333333 8.6666667 21.0000000 41.3333333 71.6666667
For regression purposes, transformations for normality, linearity and/or constant variance can be easily
implemented and tested statistically with the powerTransform function. For example, consider the variable
infant mortality from the UN dataset.
hist(UN$infantMortality)

37
Histogram of UN$infantMortality
60
50
Frequency

40
30
20
10
0

0 20 40 60 80 100 120

UN$infantMortality
# The histogram shows that the distribution is right-skewed. Therefore, we can try
# (1) test if a transformation is needed and (2) if it is needed, transform it.
p1 <- powerTransform(infantMortality ~ 1, data=UN, family="bcPower")
summary(p1)

## bcPower Transformation to Normality

## Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
## Y1 0.0468 0 -0.0879 0.1814
##
## Likelihood ratio test that transformation parameter is equal to 0
## (log transformation)
## LRT df pval
## LR test, lambda = (0) 0.4644634 1 0.49555
##
## Likelihood ratio test that no transformation is needed
## LRT df pval
## LR test, lambda = (1) 172.8143 1 < 2.22e-16
testTransform(p1, lambda=1/2)

## LRT df pval
## LR test, lambda = (0.5) 41.95826 1 9.3243e-11
In the case of Multiple Regression, we may need to transform more than one variable at the same time
(each with its own power-law). For example, we can look at the Multivariate Box Cox Highway1 data. The
following example if from the bcPower documentation:
# Multivariate Box Cox uses Highway1 data
# See: https://www.rdocumentation.org/packages/car/versions/2.1-6/topics/Highway1
summary(powerTransform(cbind(len, adt, trks, sigs1) ~ 1, Highway1))

## bcPower Transformations to Multinormality

38
## Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
## len 0.1439 0 -0.2728 0.5606
## adt 0.0876 0 -0.1712 0.3464
## trks -0.6954 0 -1.9046 0.5139
## sigs1 -0.2654 0 -0.5575 0.0267
##
## Likelihood ratio test that transformation parameters are equal to 0
## (all log transformations)
## LRT df pval
## LR test, lambda = (0 0 0 0) 6.014218 4 0.19809
##
## Likelihood ratio test that no transformations are needed
## LRT df pval
## LR test, lambda = (1 1 1 1) 127.7221 4 < 2.22e-16
# Multivariate transformation to normality within levels of 'htype'
summary(a3 <- powerTransform(cbind(len, adt, trks, sigs1) ~ htype, Highway1))

## bcPower Transformations to Multinormality

## Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
## len 0.1451 0.00 -0.2733 0.5636
## adt 0.2396 0.33 0.0255 0.4536
## trks -0.7336 0.00 -1.9408 0.4735
## sigs1 -0.2959 -0.50 -0.5511 -0.0408
##
## Likelihood ratio test that transformation parameters are equal to 0
## (all log transformations)
## LRT df pval
## LR test, lambda = (0 0 0 0) 13.1339 4 0.01064
##
## Likelihood ratio test that no transformations are needed
## LRT df pval
## LR test, lambda = (1 1 1 1) 140.5853 4 < 2.22e-16
# test lambda = (0 0 0 -1)
testTransform(a3, c(0, 0, 0, -1))

## LRT df pval
## LR test, lambda = (0 0 0 -1) 31.12644 4 2.8849e-06
# save the rounded transformed values, plot them with a separate
# color for each highway type
transformedY <- bcPower(with(Highway1, cbind(len, adt, trks, sigs1)),
coef(a3, round=TRUE))
scatterplotMatrix( ~ transformedY|htype, Highway1)

39
0 2 4 6 8 −6 −4 −2 0

len.0 FAI

1.0 2.0 3.0

MA
MC
PA

adt.0.33
8
4
0

2.6
trks.0

2.2
1.8
sigs1..0.5
−2
−6

1.0 2.0 3.0 1.8 2.0 2.2 2.4 2.6

4.3 Transforming Restricted-Range Variables

In regression models such as probit and logistic where the response variable is a probability, their range is
restricted to the interval [0, 1], however, this can present problems for such models because values can cluster
near the end points. Therefore, transformations that can spread the values such as the arcsine square-root are
good choices for these cases. This transformation is known for its variance stabilizing properties and is given
by:

√
Tasnisqrt ( x ) = sin−1 ( x )

40
par(mfrow=c(2,2))
hist(seq(0,1,length=100),main="Original", xlab="x")
hist(asin(sqrt(seq(0,1,length=100))),main="Transformed", xlab="x")
plot(seq(0,1,length=100),main="Original",type='l')
plot(asin(sqrt(seq(0,1,length=100))),main="Transformed", type='l')

Original Transformed

20
Frequency

Frequency
8

10
4
0

0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5

x x

Original asin(sqrt(seq(0, 1, length = 100))) Transformed

seq(0, 1, length = 100)

1.0
0.6
0.0

0.0

0 20 40 60 80 100 0 20 40 60 80 100

Index Index

Another popular transformation choice is the logit transformation:

x
Tlogit = logit(x) = ln
1−x
library(car)
attach(Prestige)
par(mfrow=c(1, 3))
densityPlot(~women,main="(a) Untransformed")
#densityPlot(~logit(women),main="(b) Logit",adjust=1)
densityPlot(~asin(sqrt(women/100)),main="(c) Arcsine square-root")

4.4 Transformations to Equalize Spread

Consider again the Ornstein dataset showing the number of interlocks by country.
Boxplot(interlocks ~ nation, data=Ornstein)

41
2

100 3
1
80
9
6
5
interlocks

60
40

13
27
20
0

CAN OTH UK US

nation
## [1] "1" "2" "3" "5" "6" "9" "13" "27"
From the figure we can see that it would be helpful to even out the observed spread across countries. A
measure of the spread-level can easily be obtain with Tukey’s Spread-Level plot which also provides a
suggested power-law transformation to stabilize the variance:
spreadLevelPlot(interlocks+1 ~ nation,Ornstein)

Spread−Level Plot for interlocks + 1 by nation

CAN
22

OTH
Hinge−Spread

16 18
14
12

US
10

6 8 10 12 14 16

Median
## LowerHinge Median UpperHinge Hinge-Spread
## US 2 6.0 13 11

42
## UK 4 9.0 14 10
## CAN 6 13.0 30 24
## OTH 4 15.5 24 20
##
## Suggested power transformation: 0.1534487
According to the output, the choice of λ = 0.15 would be an appropriate one. Since this value is close to 0,
which would correspond to a log function, we can try a log to see if it works:
Boxplot(log10(interlocks+1) ~ nation, data=Ornstein)
2.0
log10(interlocks + 1)

1.5
1.0
0.5
0.0

CAN OTH UK US

nation

4.5 Transformations Toward Linearity

We can consider transforming both x and y in an effort to linearize the relationship between them. For
example, we can find the values of the positive exponents p and q such that the linear regression model
q p
Yi = β 0 + β 1 Xi + ε exhibits a more linear relationship. The choice of these exponents can be guided by
Mosteller & Tukey’s Bulging Rule for Linearization as shown in the figure below.

For example, below is a visualization of this rule taken from https://dzone.com/articles/tukey-and-

mosteller\T1\textquoterights-bulging

43
fakedataMT=function(p=1,q=1,n=99,s=.1){
set.seed(1)
X=seq(1/(n+1),1-1/(n+1),length=n)
Y=(5+2*Xˆp+rnorm(n,sd=s))ˆ(1/q)
return(data.frame(x=X,y=Y))}
par(mfrow=c(2,2))
plot(fakedataMT(p=.5,q=2),main="(p=1/2,q=2)")
plot(fakedataMT(p=3,q=-5),main="(p=3,q=-5)")
plot(fakedataMT(p=.5,q=-1),main="(p=1/2,q=-1)")
plot(fakedataMT(p=3,q=5),main="(p=3,q=5)")

(p=1/2,q=2) (p=3,q=−5)

0.68 0.71
2.5
y

y
2.3

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x x

(p=1/2,q=−1) (p=3,q=5)
0.18

1.38 1.44
y

y
0.14

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x x

44
In summary, when choosing a Functional form:
• We should do our best to choose a functional form that:
– Consistent with economic theory
– Fits the data well
– Such that the assumptions of the regression model are satisfied
• In real-world problems it is sometimes difficult to achieve all these goals.
• In applications of econometrics we must simply do the best we can to choose a satisfactory functional
form.

5 Regressions with Indicator Variables

Factor variables (also known as indicator or dummy variables) are used in regression to represent categorical
variables, i.e., non-quantitative characteristics such as gender, race, drink preference, etc. The simplest
example of a factor variable is a binary one, i.e., it can only take on two values (these are referred to as the
levels of the factor). For example,

1, if house is in University Town
UTOW N =
0, if house is in Golden Oaks

PRICE = β 1 + β 2 UTOW N + e
• When an indicator variable is used in a regression, it is important to write the regression function for
the different values of the indicator variable.

β 1 + β 2 , if UTOW N = 1
E [ PRICE] =
β1 , if UTOW N = 0

• In the simple regression model, an indicator variable on the right-hand side gives us a way to estimate
the differences between population means.
• It is straightforward to show that if x is an indicator variable, we have

b1 = y0
b2 = y1 − y0

In case you learned about the test involving the difference of two means from some elementary statistics
course before, you will realize that the regression here includes this earlier topic as a special case.
Next, we will go over an example in R. In this case we have binary indicator variable for gender, I, such that
I = 1 (gender = female) and I = 0 (gender = male). Our model is a regression of hourly wages (WAGE) on
gender (FEMALE = 1(female)). In this case our model can be expressed as:

β 0 + β 1 if FEMALE = 1
E(WAGE) = β 0 + β 1 FEMALE =
β0 if FEMALE = 0.

The respective the model parameters are given by β 0 = WAGE Male and β 1 = WAGE Female − WAGE Male .

45
# Example 8: Wages (y=wage) vs. Gender (x=female)
library(POE5Rdata)
data("cps5_small")
attach(cps5_small)
# Note: When dealing with indicator variables it is important to check the number of
# observations in each category to see if the data are balanced or not.
S(cps5_small[,1:4])

## black educ exper faminc

## Min. :0.0000 Min. : 0.0 Min. : 0.00 Min. : 0
## 1st Qu.:0.0000 1st Qu.:12.0 1st Qu.:12.00 1st Qu.: 0
## Median :0.0000 Median :14.0 Median :24.00 Median : 23679
## Mean :0.0875 Mean :14.2 Mean :23.37 Mean : 35304
## 3rd Qu.:0.0000 3rd Qu.:16.0 3rd Qu.:34.00 3rd Qu.: 53029
## Max. :1.0000 Max. :21.0 Max. :62.00 Max. :469000
S(cps5_small[,5:9])

## female metro midwest south

## Min. :0.00 Min. :0.0000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.00 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.000
## Median :0.00 Median :1.0000 Median :0.0000 Median :0.000
## Mean :0.44 Mean :0.8217 Mean :0.2475 Mean :0.325
## 3rd Qu.:1.00 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:1.000
## Max. :1.00 Max. :1.0000 Max. :1.0000 Max. :1.000
## wage
## Min. : 3.94
## 1st Qu.: 13.00
## Median : 19.30
## Mean : 23.64
## 3rd Qu.: 29.80
## Max. :221.10
# Did you find any indicator variables that are not well balanced?

wage_female = wage[which(female==1)]
wage_male = wage[which(female==0)]

# First we will look at the distribution of wages conditioned on gender

par(mfrow=c(2,1))
hist(wage_female,breaks ="FD",col="skyblue2", freq = FALSE, ylab = "Density",
main = "Histogram of the Wages (Female)",xlab="Wages",ylim=c(0,0.06),xlim=c(0,60))
lines(density(wage_female),lwd = 2, col ="red")

hist(wage_male,breaks ="FD",col="skyblue2", freq = FALSE, ylab = "Density",

main = "Histogram of the Wages (Male)",xlab="Wages",ylim=c(0,0.06),xlim=c(0,60))
lines(density(wage_female),lwd = 2, col ="red")

46
Density Histogram of the Wages (Female)
0.00

0 10 20 30 40 50 60

Wages

Histogram of the Wages (Male)

Density

0.00

0 10 20 30 40 50 60

Wages
# Now we can estimate the model and interpret the results
reg.mod = lm(wage~female)
S(reg.mod)

## Call: lm(formula = wage ~ female)

##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.7624 0.5852 42.315 < 2e-16 ***
## female -2.5507 0.8822 -2.891 0.00391 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard deviation: 15.17 on 1198 degrees of freedom
## Multiple R-squared: 0.006929
## F-statistic: 8.359 on 1 and 1198 DF, p-value: 0.003906
## AIC BIC
## 9935.83 9951.10
# We interpret the y-intercept as the mean hourly wage for men (beta_0 = $11.52), and
# the slope as the difference in hourly wages between women and men (beta_1 = $-2.66).
# On average, all other things being equal, women are predicted to earn $2.66/hr less
# than men.

Econometrics for Advanced Learners
No ratings yet
Econometrics for Advanced Learners
129 pages
Regression Analysis Course Notes
No ratings yet
Regression Analysis Course Notes
73 pages
CS 2008 3complete PDF
No ratings yet
CS 2008 3complete PDF
53 pages
Hansen (2006, Econometrics)
No ratings yet
Hansen (2006, Econometrics)
196 pages
Foundations of Econometrics Using SAS Simulations and Examples
No ratings yet
Foundations of Econometrics Using SAS Simulations and Examples
56 pages
Chapter 5
No ratings yet
Chapter 5
14 pages
Regression Models in R Guide
No ratings yet
Regression Models in R Guide
137 pages
Generalized Linear Models
100% (9)
Generalized Linear Models
243 pages
SOA Exam Statistics For Risk Modelling Study Manual
No ratings yet
SOA Exam Statistics For Risk Modelling Study Manual
42 pages
Econometric S 2007
No ratings yet
Econometric S 2007
167 pages
Advanced Statistical Modelling Notes
No ratings yet
Advanced Statistical Modelling Notes
233 pages
Reg Mods
No ratings yet
Reg Mods
137 pages
Course Notes18
No ratings yet
Course Notes18
113 pages
Statistical Methods For Data Science
100% (2)
Statistical Methods For Data Science
406 pages
Econometrics in R
No ratings yet
Econometrics in R
34 pages
TOBo ML
No ratings yet
TOBo ML
120 pages
Regression Models For Data Science in R by Brian Caffo
No ratings yet
Regression Models For Data Science in R by Brian Caffo
144 pages
Regression
No ratings yet
Regression
212 pages
Etc 2410 Notes
50% (2)
Etc 2410 Notes
133 pages
Nice Econometrics Notes!!!!!!!
No ratings yet
Nice Econometrics Notes!!!!!!!
238 pages
Book IntroStatistics
No ratings yet
Book IntroStatistics
422 pages
Linear Regression With Python
No ratings yet
Linear Regression With Python
140 pages
Regression Models For Data Science in R
No ratings yet
Regression Models For Data Science in R
144 pages
REG2022
No ratings yet
REG2022
313 pages
Stat 331 Course Notes
No ratings yet
Stat 331 Course Notes
79 pages
Ekstrøm, Claus Thorn - Sørensen, Helle - Introduction To Statistical Data Analysis For The Life Sciences-CRC Press (2014)
No ratings yet
Ekstrøm, Claus Thorn - Sørensen, Helle - Introduction To Statistical Data Analysis For The Life Sciences-CRC Press (2014)
521 pages
Econometrics Written by Hansen
100% (1)
Econometrics Written by Hansen
238 pages
Multivariate Statistical Functions in R
100% (3)
Multivariate Statistical Functions in R
382 pages
Study Guide For STA3701
No ratings yet
Study Guide For STA3701
325 pages
Multivariate Data Analysis in R PDF
No ratings yet
Multivariate Data Analysis in R PDF
400 pages
Adv Stat Inf
No ratings yet
Adv Stat Inf
194 pages
Applied Robust Statistics 2005 PDF
No ratings yet
Applied Robust Statistics 2005 PDF
532 pages
Applied Robust Statistics
No ratings yet
Applied Robust Statistics
532 pages
Modern Analysis of Biological Data
No ratings yet
Modern Analysis of Biological Data
258 pages
Regression Models Notes
No ratings yet
Regression Models Notes
13 pages
Applied Statistics
No ratings yet
Applied Statistics
457 pages
STAT501 Multivariate Analysis
No ratings yet
STAT501 Multivariate Analysis
196 pages
R Corregr
No ratings yet
R Corregr
147 pages
What Statistics Books Try To Teach You But Dont Joe King University of Washington
No ratings yet
What Statistics Books Try To Teach You But Dont Joe King University of Washington
40 pages
R-Web-Appendix of Foundations of Statistics For Data Scientists
No ratings yet
R-Web-Appendix of Foundations of Statistics For Data Scientists
122 pages
305MinNotes PDF
No ratings yet
305MinNotes PDF
148 pages
ANOVA3
No ratings yet
ANOVA3
194 pages
Applied Robust Statistics-David Olive
No ratings yet
Applied Robust Statistics-David Olive
588 pages
Geostatistical Mineral Resource Estimation: AMEC Advantage Training
No ratings yet
Geostatistical Mineral Resource Estimation: AMEC Advantage Training
8 pages
TOBo ML
No ratings yet
TOBo ML
135 pages
Multiple LR Basics v2025
No ratings yet
Multiple LR Basics v2025
38 pages
Multivariate Analysis Lecture Notes For Stat 5353: J. D. Tubbs Department of Mathematical Sciences Fall Semester 2002
No ratings yet
Multivariate Analysis Lecture Notes For Stat 5353: J. D. Tubbs Department of Mathematical Sciences Fall Semester 2002
234 pages
Notes On Estimation
No ratings yet
Notes On Estimation
76 pages
Statistical Modeling Textbook
No ratings yet
Statistical Modeling Textbook
639 pages
Regbook Inside
100% (1)
Regbook Inside
21 pages
f23 Econ103 Week1 Ta Note
No ratings yet
f23 Econ103 Week1 Ta Note
6 pages
Lecture 5
No ratings yet
Lecture 5
7 pages
Wk4 Tue Jiayin Annotated
No ratings yet
Wk4 Tue Jiayin Annotated
11 pages
f23 Econ103 Week2 Ta Note
No ratings yet
f23 Econ103 Week2 Ta Note
5 pages
Psych 115 Week 5 Discussion Slides - CK
No ratings yet
Psych 115 Week 5 Discussion Slides - CK
59 pages
Wk5 Tue Jiayin
No ratings yet
Wk5 Tue Jiayin
7 pages
Psych 115
No ratings yet
Psych 115
16 pages
Cochlear Amplification
No ratings yet
Cochlear Amplification
11 pages
OHC Motility-Ashmore Review08
No ratings yet
OHC Motility-Ashmore Review08
39 pages
10B MemoryManagement
No ratings yet
10B MemoryManagement
242 pages
CourseKata R Cheatsheet ABC
No ratings yet
CourseKata R Cheatsheet ABC
5 pages
KQ2 Depth Study PDF
No ratings yet
KQ2 Depth Study PDF
5 pages
Key Quotes From The Play PDF
100% (2)
Key Quotes From The Play PDF
3 pages
RM-Lab20 - Correlation and Regression Analysis Using SPSS
No ratings yet
RM-Lab20 - Correlation and Regression Analysis Using SPSS
6 pages
Econometric Volatility Models Guide
No ratings yet
Econometric Volatility Models Guide
9 pages
01 Estimation PDF
No ratings yet
01 Estimation PDF
13 pages
Regression Statistics: Summary Output
No ratings yet
Regression Statistics: Summary Output
3 pages
09 - Regression With Time Series Data
No ratings yet
09 - Regression With Time Series Data
39 pages
Qs 2
No ratings yet
Qs 2
11 pages
Multicollinearity in Regression Analysis
No ratings yet
Multicollinearity in Regression Analysis
91 pages
Panel GMM Commands
No ratings yet
Panel GMM Commands
13 pages
Time Series Analysis
No ratings yet
Time Series Analysis
3 pages
Time-Series Forecasting: 2000 by Chapman & Hall/CRC
No ratings yet
Time-Series Forecasting: 2000 by Chapman & Hall/CRC
9 pages
Goldfeld-Quandt Test - Basic Econometrics
No ratings yet
Goldfeld-Quandt Test - Basic Econometrics
3 pages
Quiz2 ISDS 361B
No ratings yet
Quiz2 ISDS 361B
5 pages
Baum - An Introduction To Modern Econometrics Using Stata
100% (1)
Baum - An Introduction To Modern Econometrics Using Stata
376 pages
Decision Sciences II Exam Analysis
No ratings yet
Decision Sciences II Exam Analysis
20 pages
Chapter 3 - Questions
No ratings yet
Chapter 3 - Questions
15 pages
Model Selection and Model Validation
No ratings yet
Model Selection and Model Validation
36 pages
Exchange Rate Impact on Ethiopian Soap Industry
No ratings yet
Exchange Rate Impact on Ethiopian Soap Industry
59 pages
STA467/567 Fall 2022
No ratings yet
STA467/567 Fall 2022
13 pages
Cheat Sheet: Summarize Data Estimate Models, 1/2
No ratings yet
Cheat Sheet: Summarize Data Estimate Models, 1/2
2 pages
Exponential Smoothing Guide
No ratings yet
Exponential Smoothing Guide
2 pages
Advanced Econometrics Course Guide
No ratings yet
Advanced Econometrics Course Guide
11 pages
Regression Statistics: Output of Rocinante 36
No ratings yet
Regression Statistics: Output of Rocinante 36
5 pages
Chapter 6 (Business Statistics 1 BA 1315)
No ratings yet
Chapter 6 (Business Statistics 1 BA 1315)
12 pages
Hierarchical Regression on Satisfaction
No ratings yet
Hierarchical Regression on Satisfaction
3 pages
Two-Variable Regression Model: The Problem of Estimation: Gujarati 4e, Chapter 3
No ratings yet
Two-Variable Regression Model: The Problem of Estimation: Gujarati 4e, Chapter 3
15 pages
Pide Campus
No ratings yet
Pide Campus
2 pages
Stata 1
No ratings yet
Stata 1
5 pages
Economics Semester System - Syllabys - 0
No ratings yet
Economics Semester System - Syllabys - 0
19 pages
215 Final Exam Formula Sheet
No ratings yet
215 Final Exam Formula Sheet
2 pages
CH 15
No ratings yet
CH 15
21 pages