[go: up one dir, main page]

0% found this document useful (0 votes)
9 views47 pages

Chapter 2

Uploaded by

issy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views47 pages

Chapter 2

Uploaded by

issy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

The Simple Linear Regression Model

Dr. Randall R. Rojas

Contents
1 Univariate Characterizations 1
1.1 Statistical Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Quantile Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Bivariate Characterizations 9
2.1 Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Jittering Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 The Method of Least Squares for the Simple Linear Regression 13


3.1 An Econometric Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.1 Data Generating Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.2 The Random Error and Strict Exogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.3 The Regression Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.4 Random Error Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.5 Variation in x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.6 Error Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.7 Error Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Estimating the Regression Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 The Gauss-Markov Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Distribution of the Least Squares Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5 Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5.1 Boostrapping Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4 Estimating Nonlinear Relationships 29


4.1 Logarithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Power Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Transforming Restricted-Range Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4 Transformations to Equalize Spread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5 Transformations Toward Linearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5 Regressions with Indicator Variables 45

1 Univariate Characterizations
In a regression setting we first start by looking at a univariate characterization of the data. This entails
looking at the respective statistical summaries, features, and distributions of all the variables we have.

1
Among the popular univariate graphical tools, we will consider histograms, density plots, qqplots, and
boxplots.

1.1 Statistical Summary

Looking at the summary statistics of our data allows to identify any potential issues we may encounter when
building our regression model. For example, there may NAs, values outside the expected range, etc. Below
are two examples using the pastecs and psych libraries.
library(pastecs)
# Summary for 1 variable:
stat.desc(rnorm(100),norm=TRUE)

## nbr.val nbr.null nbr.na min max range


## 100.0000000 0.0000000 0.0000000 -2.9105814 2.8800428 5.7906242
## sum median mean SE.mean CI.mean.0.95 var
## -20.1869404 -0.2160320 -0.2018694 0.1042231 0.2068013 1.0862461
## std.dev coef.var skewness skew.2SE kurtosis kurt.2SE
## 1.0422313 -5.1628989 0.1544150 0.3198589 0.1519546 0.1588383
## normtest.W normtest.p
## 0.9950863 0.9773210
# Summary for an entire datacube:
library(car)
attach(Prestige)
#stat.desc(Prestige, norm=TRUE)

library(psych)
describe(Prestige)

## vars n mean sd median trimmed mad min max


## education 1 102 10.74 2.73 10.54 10.63 3.15 6.38 15.97
## income 2 102 6797.90 4245.92 5930.50 6161.49 3060.83 611.00 25879.00
## women 3 102 28.98 31.72 13.60 24.74 18.73 0.00 97.51
## prestige 4 102 46.83 17.20 43.60 46.20 19.20 14.80 87.20
## census 5 102 5401.77 2644.99 5135.00 5393.87 4097.91 1113.00 9517.00
## type* 6 98 1.79 0.80 2.00 1.74 1.48 1.00 3.00
## range skew kurtosis se
## education 9.59 0.32 -1.03 0.27
## income 25268.00 2.13 6.29 420.41
## women 97.51 0.90 -0.68 3.14
## prestige 72.40 0.33 -0.79 1.70
## census 8404.00 0.11 -1.49 261.89
## type* 2.00 0.40 -1.36 0.08
# Note: The flag norm allows us to also include normal
# distribution statistics such as skewness, kurtosis, etc.

1.2 Histograms

Histograms should be one our first analysis tool since they can help us learn about each variable’s properties.
Things to look for:
• Outliers: This can influence the regression fit in many ways

2
• Shape: Is the distribution symmetric, skewed, etc?
• Range: Is the range of values very large? For example several orders of magnitude?
• Density of observations and/or clustering features. Are the data all clustered in certain regions? In high density
regions your regression fit may be robust but then become unstable in less dense ones.
The near optimal number of bins to use based on the number of observations n is given by:
k = 1 + log2 (n)
However, more sample-specific formulas exist such as Freedman’s and Diaconis (FD)’ suggested:
n1/3 (max − min)
2( Q3 − Q1 )
For example, we can compare the two methods for 2 different data sets:
#Compare histograms using k vs FD:
#quartz()
par(mfrow=c(2,2))
n = 1000;k = 1 + log2(n)
hist(rnorm(n),breaks = k,col="skyblue4", main = "Normal Distr. Using k ")
hist(rnorm(n),breaks = "FD",col="skyblue4", main = "Normal Distr. Using FD")
n = 1000;k = 1 + log2(n)
hist(rexp(n,0.5),breaks = k,col="skyblue2", main ="Exp Distr. Using k")
hist(rexp(n,0.5),breaks = "FD",col="skyblue2", main ="Exp Distr. Using FD")

Normal Distr. Using k Normal Distr. Using FD


350

80
Frequency

Frequency

40
0 150

−4 −2 0 2 4 −4 −3 −2 −1 0 1 2 3

rnorm(n) rnorm(n)

Exp Distr. Using k Exp Distr. Using FD


Frequency

Frequency
200

100
0

0 2 4 6 8 10 12 0 5 10 15

rexp(n, 0.5) rexp(n, 0.5)

1.3 Density Estimation

Histograms are very informative but being able to visualize the density plot, along with the histogram and
1D scatterplot (rug-plot) can be more informative. A common kernel-density (nonparametric) estimate used
is given by:

3
n
x − xi
 
1
pb( x ) =
nh ∑K h
i =1

where K is kernel function (e.g., Normal distribution), h is the bandwidth (amount of smoothness), and n the
number of observations.
#quartz()
n=1000
hist(rexp(n,0.5),breaks ="FD",col="skyblue2", freq = FALSE, ylab = "Density")
lines(density(rexp(n,0.5)),lwd = 2, col ="red")
lines(density(rexp(n,0.5), adjust =0.5),lwd = 2, col = "blue", type ='l', lty=2)
rug(rexp(n,0.5))

Histogram of rexp(n, 0.5)


0.4
0.3
Density

0.2
0.1
0.0

0 2 4 6 8 10 12

rexp(n, 0.5)
#You can also show the density with color
library(ggplot2)
data=data.frame(value=rnorm(10000))
ggplot(data, aes(x=value))+geom_histogram(binwidth = 0.2, aes(fill = ..count..) )

4
800

600

count

600
count

400
400

200

0
200

−4 −2 0 2 4
value

1.4 Quantile Plots

These plots are very useful for comparing the distribution of a variable to a theoretical reference distribution.
By default the Normal distribution is used in the function ‘qqPlot’ (Quantile - Quantile) but we can provide
any other distribution.
#The dataset Prestige from the car package has 102 observations
library(car)
attach(Prestige)
qqPlot(~ income, data=Prestige, id=list(n=3))

5
25000
general.managers
physicians

lawyers
15000
income

5000
0

−2 −1 0 1 2

norm quantiles
## general.managers physicians lawyers
## 2 24 17
qqPlot(income ~ type, data=Prestige, layout=c(1, 3))

type = bc type = prof type = wc

general.managers
25000

25000

25000
physicians
20000

20000

20000
15000

15000

15000
income

income

income
10000

10000

10000

firefighters commercial.travellers
insurance.agents
5000

5000

5000

farm.workers

−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2

norm quantiles norm quantiles norm quantiles

# Type = Professional (Prof), White Collar (WC), "Blue Collar (BC)"

6
Studentized Residuals(lm(prestige ~ income + education + type, data = Duncan))
qqPlot(lm(prestige ~ income + education + type, data=Duncan),
envelope=.99)
4

minister
3

machinist
2
1
0
−1

−2 −1 0 1 2

t Quantiles

## minister machinist
## 6 28

1.5 Boxplots

Boxplots allow us know more quantitative aspects of a variables distribution such as the min, max, Q1,
Q3, IQR, and Median. Another helpful property is that they are very convenient when comparing many
distributions simultaneously and/or conditioning on other variables.
#Boxplot of Income
Boxplot(~income, data=Prestige)

7
25000
general.managers
physicians

lawyers
osteopaths.chiropractors
15000
income

veterinarians
5000
0

## [1] "general.managers" "lawyers"


## [3] "physicians" "veterinarians"
## [5] "osteopaths.chiropractors"
# Note: For more choices of boxplots, look at: https://www.r-graph-gallery.com/boxplot/

# These are known as Parallel Boxplots


Boxplot(Ornstein$interlocks~Ornstein$nation)

2
100

3
1
Ornstein$interlocks

80

9
6
5
60
40

13
27
20
0

CAN OTH UK US

Ornstein$nation
## [1] "1" "2" "3" "5" "6" "9" "13" "27"
# Note: interlocks = Number of interlocking director and executive positions
# shared with other major firms.

library("plotrix")
means <- Tapply(interlocks ~ nation, mean, data=Ornstein)
sds <- Tapply(interlocks ~ nation, sd, data=Ornstein)

8
plotCI(1:4, means, sds, xaxt="n", xlab="Nation of Control",
ylab="interlocks", ylim=range(Ornstein$interlocks))
lines(1:4, means)
axis(1, at=1:4, labels = names(means))
100
80
interlocks

60
40
20
0

CAN OTH UK US

Nation of Control

2 Bivariate Characterizations
Once you have the univariate description of the variables, we can then exam the pairwise associations via
e.g., scatterplots

2.1 Scatterplots

library(car)
attach(Prestige)
scatterplot(prestige ~income, lwd=3, id=TRUE)

9
when looking at
graphs, look fo
relationship = r line and shap
positive trend e
too much discre
ptancy betwee
model is not a n line and shap
good fit e of plots sugg
ests that the
24

80
60 2
prestige

40
20

0 5000 10000 15000 20000 25000

income

## [1] 2 24
#Note: Lowess smoother is a 'locally weighted smoother'

#We may want to condition on other variables, for example, on type:


scatterplot(prestige ~income|type, lwd=3,col=c("blue", "green", "red"),id=TRUE)

type
bc prof wc

24
20 30 40 50 60 70 80

31 2
prestige

82
58
51

5000 10000 15000 20000 25000

income

10
## 31 58 51 82
## 1 2 3 19 24 25
#If we have several predictors in our model, it is more practical to
# simply look at as scatterplot matrix of the data. For example:
# scatterplotMatrix(~ prestige + income + education + women)

#We can also look at all the variables conditioned on type:


scatterplotMatrix(~ prestige + income + education + women | type,smooth=FALSE, ellipse=list(levels=0.5))

#We can also look at 3D visualizations


scatter3d(prestige ~ income + education)
axis(1, at=1:4, labels = names(means))
5000 15000 25000 0 20 40 60 80

prestigebc

80
prof

50
wc

20
20000

income
5000

education

14
10
6
women
80
40

CAN
0

20 40 60 80 6 8 10 12 14 16

2.2 Jittering Scatterplots

Jittering the data (adding noise) helps separate overplotted observations


plot(Vocab$vocabulary~Vocab$education)

11
10
8
Vocab$vocabulary

6
4
2
0

0 5 10 15 20

Vocab$education
#Add some random noise to the variables --> jitter the data
plot(jitter(Vocab$vocabulary)~jitter(Vocab$education))
10
jitter(Vocab$vocabulary)

8
6
4
2
0

0 5 10 15 20

jitter(Vocab$education)
# Increase the amount of jitter
plot(jitter(Vocab$vocabulary,factor =2)~jitter(Vocab$education, factor =2),col="gray",cex=0.5)
abline(lm(Vocab$vocabulary~ Vocab$education),lwd= 3, lty="dashed")
lines(lowess(Vocab$education,Vocab$vocabulary,f=0.2), lwd =3, col="red")

12
jitter(Vocab$vocabulary, factor = 2)

10
8
6
4
2
0

0 5 10 15 20

jitter(Vocab$education, factor = 2)

3 The Method of Least Squares for the Simple Linear Regression


Economic theory suggests many relationships between economic variables.A regression model is helpful in
questions such as the following: if one variable changes in a certain way, by how much will another variable
change? For example, by how much do you expect hourly wages to increase with an additional year of
education?
• The regression model is based on assumptions about the conditional pdf of y (e.g., hourly wages) given
x (e.g., years of education).
– The continuous random variable y has a probability density function (pdf).
– The pdf is a conditional probability density function because it is “conditional” upon an x*
• For example, households with an income of $1000 per week would have various food expenditure per
person for a variety of reasons.
• The pdf f (y) describes how expenditures are distributed over the population.
• This is a conditional pdf because it is “conditional” upon household income.
• The conditional mean, or expected value, of y is E [y| x = $1000] = µy| x and is our population’s mean
weekly food expenditure per person.
• The conditional variance of y is Var [y| x = $1000] = σ2 which measures the dispersion of household
expenditures y about their mean.
• The parameters µy| x and σ2 , if they were known, would give us some valuable information about the
population we are considering.
• In order to investigate the relationship between expenditure and income we must build an economic
model and then a corresponding econometric model that forms the basis for a quantitative or empirical
economic analysis.
• This econometric model is also called a regression model.

13
Table 1: Terminology:
y x
Dependent Variable Independent Variable
Explained Variable Explanatory Variable
Response Variable Control Variable
Predicted Variable Predictor Variable
Regressand Regressor

3.1 An Econometric Model

The Econometric Model is the motivation, e.g., it could be a supply equation, therefore, it comes from economic
theory. The parameters of the model are also known as the population parameters. For example, let’s
examine the relationship between weekly food expenditure and weekly household income.
• A household spends $80 plus 10 cents of each dollar of income received on food.
• Algebraically the rule is y = 80 + 0.10x where y = weekly household food expenditure ($) and x =
weekly household income ($).
• In reality, many factors may affect household expenditure on food.
• Let e (error term) = everything else affecting food other than income.
• y = β 1 + β 2 x + e. This equation is the simple regression model.
• A simple linear regression analysis examines the relationship between a y-variable and one x-variable.

R Example
library(POE5Rdata)
# Histograms
data("food")
hist(food$income,col="skyblue3", main = "PDF of Weekly Income = f(x)")

14
PDF of Weekly Income = f(x)
12
10
8
Frequency

6
4
2
0

0 5 10 15 20 25 30 35

food$income
hist(food$food_exp, col="skyblue3", main = "PDF of Weekly Food Expennditure = f(y|x)")

PDF of Weekly Food Expennditure = f(y|x)


7
6
5
Frequency

4
3
2
1
0

100 200 300 400 500 600

food$food_exp
hist(food$food_exp[food$income<=20],col="skyblue3", main = "PDF of f(y|x<2000)")

15
PDF of f(y|x<2000)
5
4
Frequency

3
2
1
0

100 150 200 250 300 350 400

food$food_exp[food$income <= 20]


hist(food$food_exp[food$income>20],col="skyblue3", main = "PDF of f(y|x>2000)")

PDF of f(y|x>2000)
6
5
4
Frequency

3
2
1
0

100 200 300 400 500 600

food$food_exp[food$income > 20]


#We can also look at the estimate of the errors, i.e., the residuals
ols.mod <- lm(food_exp~income, data=food)
plot(food$income, ols.mod$residuals, pch=20,ylab="Resdiduals", xlab="WeeklyIncome ($100s)")

16
abline(h=0,col="red", lwd=2)

200
100
Resdiduals

0
−100
−200

5 10 15 20 25 30

WeeklyIncome ($100s)
# Note: From the residuals plot we can see that the variance is not constant

3.1.1 Data Generating Process

The Simple Linear Regression (SLR) model is given by:

y i = β 1 + β 2 x i + ei

In R (and most other software packages) the default null hypothesis tested is H0 : β k = 0 against the
alternative H1 : β k 6= 0, k = 1, 2. Therefore, if the respective estimated p-value is small (p−value < α), we
reject H0 and conclude that the respective parameter is statistically significantly different from 0.
The Simple Linear Regression (SR) assumptions of the model are that:
1. *Linearity:* The value of y, for each value of x, is y = β 0 + β 1 x + e
2. E[ei | xi ] = 0 which is the same as E[y] = β 0 + β 1 x
3. *Homoskedasticity:* Var[ei | xi ] = Var[y] = σ2 (constant variance condition, i.e., the errors are *ho-
moskedastic*)
4. cov(yi , y j ) = cov(ei , e j | x ) = 0 (even better if the errors are independent which is a stronger condition)
5. *Rank Condition:* x is not random and takes on at least two different values
6. *Normality (Optional)*: The errors are normally distributed, i.e., ei | xi ∼ N (0, σ2 )
Note: Since the data {( x1 , y1 ), . . . , ( xn , yn )} is random sample, both x’s and y’s are independent and identi-
cally distributed (iid).

17
3.1.2 The Random Error and Strict Exogeneity

• The second assumption of the simple regression model concerns the “everything else” term e.
• Unlike food expenditure and income, the random error term is not observable; it is unobservable.
• The x-variable, income, cannot be used to predict the value of e.
• E [ei | xi ] = 0 has two implications:
– E [ ei ] = 0
– cov (ei , xi ) = 0
To better understand the error term we can think of decomposing the observation y into 2 components:
1. E[y| x ] = β 1 + β 2 x, which represents the systematic component of y, and
2. e = y − E[y| x ] = y − β 1 − β 2 x, which represents the random component and thus, encapsulates
everything else that contributes to the variations in y. Therefore,

y = β1 + β2 x + e
|{z} .
|{z} | {z }
Data Model Uncertainty

R Example
library(POE5Rdata)
data("food")
# Test the covariance assumption
ols.mod <- lm(food_exp~income, data=food)
cov(food$income, ols.mod$residuals) #Seems close enough to 0

## [1] -1.166019e-14
# Test E[e] = 0
mean(ols.mod$residuals) #Seems close enough to 0

## [1] -4.973799e-15

3.1.3 The Regression Function

The conditional expectation E [yi | xi ] = β 1 + β 2 xi is called the regression function.


This also says given a change in x, ∆x, the resulting change in E [yi | xi ] is β 2 ∆x holding all else constant.
We can say that a change in x leads to, or causes, a change in the expected (population average) value of y
given xi , E [yi | xi ]

18
• The fitted regression line is
ybi = b1 + b2 xi
and the residual is
ei = yi − ybi = yi − b1 − b2 xi
b

• The b’s are chosen to minimize

∑ bei2 = ∑ (yi − ybi )2 = ∑ (yi − b1 − b2 xi )2


i i i

• The First Order Condition (FOC) of minimization is

0 = ∑ (yi − b1 − b2 xi ) = ∑ bei
i i
0 = ∑ xi (yi − b1 − b2 xi ) = ∑ xi bei
i i

• We will call the estimators given by

∑ ( xi − x ) ( yi − y )
b2 = 2
∑ ( xi − x )
b1 = y − b2 x

the ordinary least squares estimators. “Ordinary least squares” is abbreviated as OLS.
• It can be shown that
N
b2 = ∑ wi y i , where,
i =1

( xi − x )
wi = 2
∑ ( xi − x )

• It could also be written as


N
b2 = β 2 + ∑ wi ei
i =1

based on which we obtain

19
E [b2 ] = β2

i.e., it is unbiased
• It can further be shown that

σ2
Var (b2 ) = 2
∑ ( xi − x )
∑ xi2
Var (b1 ) = σ2
N ∑ ( x i − x )2
x
cov (b1 , b2 ) = − σ2
N ∑ ( x i − x )2

• Note that
1. The larger the variance term, the greater the uncertainty there is in the statistical model, and the
larger the variances and covariance of the least squares estimators.
2. The larger the sum of squares ∑ ( xi − x )2 , the smaller the variances of the least squares estimators
and the more precisely we can estimate the unknown parameters.
3. The larger the sample size N, the smaller the variances and covariance of the least squares
estimators.

3.1.4 Random Error Variation

Ideally, the conditional variance of the random error is constant.


Var [ei | xi ] = σ2 This is the homoskedasticity assumption.
Assuming the population relationship yi = β 1 + β 2 xi + ei the conditional variance of the dependent variable
is

Var [yi | xi ] = Var [ β 1 + β 2 xi + ei | xi ] = Var [ei | xi ] = σ2

If this assumption is violated, and Var [ei | xi ] may depend on xi , then the random errors are said to be
heteroskedastic.
R Example
library(POE5Rdata)
data("food")
#A plot of the estimate of the errors, i.e., the residuals, shows that they are heteroskedastic.
ols.mod <- lm(food_exp~income, data=food)
plot(food$income, ols.mod$residuals, pch=20,ylab="Resdiduals", xlab="WeeklyIncome ($100s)")
abline(h=0,col="red", lwd=2)

20
200
100
Resdiduals

0
−100
−200

5 10 15 20 25 30

WeeklyIncome ($100s)
# Note: From the residuals plot we can see that the variance is not constant

3.1.5 Variation in x

In a regression analysis, one of the objectives is to estimate β 2 = ∆E [yi | xi ] /∆x.


• If we are to hope that a sample of data can be used to estimate the effects of changes in x, then we must
observe some different values of the explanatory variable x in the sample.
• The minimum number of x-values in a sample of data that will allow us to proceed is two.
• β 2 in this case represents the slope, which tells us by how much does the dependent variable y change
if the independent variable x is increased by 1 unit.

21
3.1.6 Error Normality

• It is not at all necessary for the random errors to be conditionally normal in order for regression analysis
to “work”.
• When samples are small, it is advantageous for statistical inferences that the random errors, and
dependent variable y, given each x-value, are normally distributed.
• Central Limit Theorem says roughly that collections of many random factors tend toward having a
normal distribution.

3.1.7 Error Correlation

• It is possible that there are correlations between the random error terms.
• With cross-sectional data, data collected at one point in time, there may be a lack of statistical indepen-
dence between random errors for individuals who are spatially connected.
• Within a larger sample of data, there may be clusters of observations with correlated errors because of
the spatial component.
• The starting point in regression analysis is to assume that there is no error correlation.

3.2 Estimating the Regression Parameters

• We can use the sample data values of yi and xi from the food data set, to estimate the unknown
regression parameters β 1 and β 2 .
• These parameters represent the unknown intercept and slope coefficients for the food expenditure–
income relationship.
• If we represent the 40 data points as (yi , xi ), i = 1, . . . , 40, and plot them, we obtain the scatter diagram
below.
• Our problem is to estimate the location of the mean expenditure line.
• We would expect this line to be somewhere in the middle of all the data points.

22
23
R Example
library(POE5Rdata)
data("food")
ols.mod <- lm(food_exp~income, data=food)
plot(food$income,food$food_exp, pch=20,
xlab="Weekly Income ($100s)", ylab="Weekly Food Expenditure ($)")
abline(ols.mod,lwd=2,col="blue")
legend(6, 550, legend="Linear Relatonship Model: Mean Expennditure Line",col=c("blue"),lty=1, cex=0.8)
600
Weekly Food Expenditure ($)

Linear Relatonship Model: Mean Expennditure Line


500
400
300
200
100

5 10 15 20 25 30

Weekly Income ($100s)

R Example
library(POE5Rdata)
# First we look at the histograms
data("food")

ols.mod <- lm(food_exp~income, data=food)


summary(ols.mod)

##
## Call:
## lm(formula = food_exp ~ income, data = food)
##
## Residuals:
## Min 1Q Median 3Q Max
## -223.025 -50.816 -6.324 67.879 212.044
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 83.416 43.410 1.922 0.0622 .
## income 10.210 2.093 4.877 1.95e-05 ***
## ---

24
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 89.52 on 38 degrees of freedom
## Multiple R-squared: 0.385, Adjusted R-squared: 0.3688
## F-statistic: 23.79 on 1 and 38 DF, p-value: 1.946e-05
names(ols.mod)

## [1] "coefficients" "residuals" "effects" "rank"


## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
ols.mod$coefficients[1]

## (Intercept)
## 83.416
ols.mod$coefficients[2]

## income
## 10.20964
# If you just want the values: ols.mod$coefficients[[1]], ols.mod$coefficients[[2]]

3.3 The Gauss-Markov Theorem

• Under our assumptions of the linear regression model, the estimators b1 and b2 have the smallest
variance of all linear and unbiased estimators of b1 and b2 . They are the Best Linear Unbiased estimators
(BLUE) of β 1 and β 2 .
• The estimators b1 and b2 are best within their class because they have the minimum variance. When
comparing two linear and unbiased estimators, we always want to use the one with the smaller
variance, because that estimation rule gives us the higher probability of obtaining an estimate that is
close to the true parameter value.
• For this to be the case, all of our assumptions should be true, except for the normality. For example, we
need homoskedasticity.

3.4 Distribution of the Least Squares Estimators

• If we make the normality assumption, we get


!
σ2
b2 | x ∼ N β2 , 2
∑ ( xi − x )
!
2 ∑ xi2
b1 | x ∼ N β1 , σ
N ∑ ( x i − x )2

• A (graduate level version of the) central limit theorem says that the above is approximately correct
even if the errors are not normally distributed, as long as N is sufficiently large.
• We can estimate σ2 by

∑iN=1 b
ei2
σ2 =
N−2
b

25
where

ei
b = yi − ybi = yi − b1 − b2 xi

σ2 is unbiased for σ2 .
• It can be shown that b
• Note that the denominator is N − 2, not N.
• We can then construct

σ2
b
d (b2 )
Var = 2
∑ ( xi − x )
∑ xi2
d (b1 )
Var = σ2
N ∑ ( x i − x )2
b

x
ov (b1 , b2 )
cd = −bσ2
N ∑ ( x i − x )2

• We often write
q
se (b2 ) = d (b2 )
Var
q
se (b1 ) = d (b1 )
Var

R Example
library(POE5Rdata)
data("food")

ols.mod <- lm(food_exp~income, data=food)


vcov(ols.mod)

## (Intercept) income
## (Intercept) 1884.44226 -85.903157
## income -85.90316 4.381752
# If you just want the values: ols.mod$coefficients[[1]], ols.mod$coefficients[[2]]

3.5 Random Sampling

3.5.1 Boostrapping Regression Models

In general, for any regression model estimated, it is a good idea to also include bootstrap estimates as way to
test for the robustness of the estimates. The core principle being simple random sampling with replacement
from the original dataset, and for each sample estimating the same regression model and then collecting the
usual statistics.
R Example
library(boot)
library(car)
set.seed(3435)
library(POE5Rdata)
data("food")
trans.mod = lm(food_exp~income, data=food)

26
betahat.boot = Boot(trans.mod, R=999)
usualEsts = summary(trans.mod)$coef[, 1:2]
summary(betahat.boot)

##
## Number of bootstrap replications R = 999
## original bootBias bootSE bootMed
## (Intercept) 83.416 -0.487805 31.391 81.858
## income 10.210 0.041439 1.974 10.241
confint(betahat.boot)

## Bootstrap bca confidence intervals


##
## 2.5 % 97.5 %
## (Intercept) 21.42506 148.31179
## income 6.49895 14.45268
hist(betahat.boot)
Normal Density
Kernel Density
bca 95% CI
Obs. Value
0.015

0.20
0.010
Density

Density

0.10
0.005
0.000

0.00

−50 50 150 250 0 5 10 15 20

(Intercept) income
# Bootstrap for the estimated residual standard deviation:
sigmahat.boot <- Boot(trans.mod, R=199, f=sigmaHat, labels="sigmaHat")
summary(sigmahat.boot)

## R original bootBias bootSE bootMed


## sigmaHat 199 89.517 -3.2486 8.5707 87.046
confint(sigmahat.boot)

## Bootstrap bca confidence intervals


##

27
## 2.5 % 97.5 %
## sigmaHat 77.43262 109.2678
R Example
library(boot)
# function to obtain R-Squared from the data
rsq <- function(formula, data, indices) {
d <- data[indices,] # allows boot to select sample
fit <- lm(formula, data=d)
return(summary(fit)$r.square)
}
library(POE5Rdata)
data("food")
results <- boot(data=food, statistic=rsq, R=1000, formula=food_exp~income)
ci1=boot.ci(results, type="bca", index=1,conf=0.95)
ci_rsq = ci1$bca[ , c(4, 5)]

hist(results$t[,1], main = 'Coefficient of Determination:food_exp ~ income', xlab = 'R- Squared', col =


lines(density(results$t[,1]), col = 'blue',lwd=2)
abline(v = ci_rsq, col = 'red',lwd=3,lty=3)
abline(v=mean(results$t[,1]), col='red',lwd=3,lty=1)

Coefficient of Determination:food_exp ~ income


5
4
3
Density

2
1
0

0.0 0.2 0.4 0.6

R− Squared

28
4 Estimating Nonlinear Relationships
When the data are not suitable for a linear model, we can consider non-linear transformations on y and/or x.
In econometrics 4 very popular non-linear models are the (1) log-log, (2) log-linear, (3) linear-log, and (4)
quadratic models. The table below shows a summary of these models and their respective interpretations.

Recall that the elasticity, ε, can be computed as ε = slope × yx . For example, a quadratic model for the house
prices such as

PRICE = α1 + α2 SQFT 2 + e

has a slope given by

 
d PRICE
\
= 2b
α2 SQFT
dSQFT

and thus, the respective elasticity, is given by


 
SQFT
ε = 2b
α2 SQFT .
PRICE
\

As an example, let’s estimate the quadratic regression model above.

R Example
library(POE5Rdata)
# Home Prices (y=price) vs. Living Area (x=sqft)
data(br)
reg.mod <- lm(price~I(sqftˆ2), data=br)
S(reg.mod)

29
## Call: lm(formula = price ~ I(sqft^2), data = br)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.578e+04 2.890e+03 19.30 <2e-16 ***
## I(sqft^2) 1.542e-02 3.131e-04 49.25 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard deviation: 68210 on 1078 degrees of freedom
## Multiple R-squared: 0.6923
## F-statistic: 2426 on 1 and 1078 DF, p-value: < 2.2e-16
## AIC BIC
## 27110.35 27125.30
b1 <- coef(reg.mod)[[1]]
b2 <- coef(reg.mod)[[2]]
plot(br$sqft, br$price, xlab="Area (ftˆ2)", ylab="Home Price ($)",pch=20)
curve(b1+b2*xˆ2, col="red", add=TRUE,lwd=2)
1000000 1500000
Home Price ($)

500000
0

2000 4000 6000 8000

Area (ft^2)
# Q1: Is the quadratic fit a good one?
# A1: From the scatterplot we can see that for
# homes with sqft>4000, the spread increases and the number
# of observations decreases.
# This suggests that our fit will worsen as we consider
# larger values of x beyond ~ 4000ftˆ2.
#Q2: Are the residuals normally distributed?
hist(reg.mod$residuals,breaks ="FD",col="skyblue2", freq = FALSE, ylab = "Density",
main = "Histogram of the Residuals")
lines(density(reg.mod$residuals),lwd = 2, col ="red")

30
Histogram of the Residuals
0.0e+00 4.0e−06 8.0e−06 1.2e−05
Density

−6e+05 −2e+05 0e+00 2e+05 4e+05 6e+05

reg.mod$residuals
library(tseries)
jarque.bera.test(reg.mod$residuals)

##
## Jarque Bera Test
##
## data: reg.mod$residuals
## X-squared = 61385, df = 2, p-value < 2.2e-16
# We can also compute traditional metrics such as predicted
# values and elasticities
sqftx=c(2000, 4000, 6000) #given values for sqft
pricex=b1+b2*sqftxˆ2 #prices corresponding to given sqft
DpriceDsqft <- 2*b2*sqftx # marginal effect of sqft on price
elasticity=DpriceDsqft*sqftx/pricex
cat("DpriceDsqft (x =2000, 4000, 6000) =", DpriceDsqft)

## DpriceDsqft (x =2000, 4000, 6000) = 61.68521 123.3704 185.0556


cat("elasticity (x =2000, 4000, 6000) =", elasticity)

## elasticity (x =2000, 4000, 6000) = 1.050303 1.631251 1.817408

R Example
# Weekly Food Expenditure (y=food_exp in $) vs. Income (x=income in units of $100)
# We will consider a Linear-Log Model: y = beta_0 + beta_1 log(x)
#library(PoEdata)
data(food)
head(food)

## food_exp income

31
## 1 115.22 3.69
## 2 135.98 4.39
## 3 119.34 4.75
## 4 114.96 6.03
## 5 187.05 12.47
## 6 243.92 12.98
plot(food$income, food$food_exp, xlab="Income ($100)", ylab="Food Expenditure ($)",pch=20)
reg.mod <- lm(food$food_exp~log(food$income))
S(reg.mod)

## Call: lm(formula = food$food_exp ~ log(food$income))


##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -97.19 84.24 -1.154 0.256
## log(food$income) 132.17 28.80 4.588 4.76e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard deviation: 91.57 on 38 degrees of freedom
## Multiple R-squared: 0.3565
## F-statistic: 21.05 on 1 and 38 DF, p-value: 4.76e-05
## AIC BIC
## 478.83 483.90
y=predict(reg.mod,interval="confidence")
matlines(food$income,y,lwd=2,col="blue")
600
500
Food Expenditure ($)

400
300
200
100

5 10 15 20 25 30

Income ($100)
#Interpretation 1: For a household with $1000 weekly income,
# we expect an additional weekly food spending of $13.22
# for an additional $100 increase in weekly income.
#Interpretation 2: For a 1% increase in income,

32
# we expect food expenditure to increase by $1.32/week.
plot(food$income,reg.mod$residuals,pch=20, ylab="Residuals", xlab="Income")
abline(h=0, lwd=2, col="red")
200
100
Residuals

0
−200 −100

5 10 15 20 25 30

Income
#Q: Are the residuals normally distributed?
hist(reg.mod$residuals,breaks ="FD",col="skyblue2", freq = FALSE, ylab = "Density",
main = "Histogram of the Residuals",xlab="Residuals")
lines(density(reg.mod$residuals),lwd = 2, col ="red")

Histogram of the Residuals


0.004
Density

0.002
0.000

−200 −100 0 100 200

Residuals

33
library(tseries)
jarque.bera.test(reg.mod$residuals)

##
## Jarque Bera Test
##
## data: reg.mod$residuals
## X-squared = 0.19989, df = 2, p-value = 0.9049

4.1 Logarithms

In econometrics logarithms are very useful when dealing with variables that have span different orders
of magnitude yet must be included in the regression model. For example consider a model that includes
the predictors GDP ( 1010 ) and real interest rates ( 10−2 ). Logarithms can help us mitigate violations to the
constant variance assumption often used in Least Squares. By default R uses natural logs when you use the
log function.
# Here we look at the Ornstein data which includes financial assets of 248 Canadian companies.
par(mfrow=c(1, 2), mar=c(5, 4, 6, 2) + 0.1)
densityPlot(~ assets, data=Ornstein, xlab="assets", main="Untransformed")
densityPlot(~ log10(assets), data=Ornstein, adjust=0.65,
xlab=expression(log[10]~"(assets)"), main="Log Plot")
basicPowerAxis(0, base=10, side="above", at=10ˆ(2:5),
axis.title="")

Untransformed Log Plot


100 10000
0.8
0.00030

0.6
Density

Density
0.00015

0.4
0.2
0.00000

0.0

0 50000 150000 2 3 4 5

assets log10 (assets)


scatterplot(infantMortality ~ ppgdp, data=UN, xlab="GDP per Capita",
ylab="Infant Mortality Rate (per 1000 births)", main="Untransformed")

34
Untransformed

Infant Mortality Rate (per 1000 births)

100 120
80
60
40
20
0

0e+00 2e+04 4e+04 6e+04 8e+04 1e+05

GDP per Capita

scatterplot(infantMortality ~ ppgdp, data=UN, xlab="GDP per capita",


ylab="Infant Mortality Rate (per 1000 births)", main="Log-Log Plot",
log="xy", id=list(n=3))

Log−Log Plot
Infant Mortality Rate (per 1000 births)

100

Angola Equatorial Guinea


50

Gabon
20
10
5
2

1e+02 5e+02 5e+03 5e+04

GDP per capita

35
## Angola Equatorial Guinea Gabon
## 4 54 62

4.2 Power Transformations

Power transformations of the predictors and/or response variable(s) can help improve their respective
distributions for modeling/estimation purposes (e.g., Least Squares estimates of regression coefficients) and
sampling/robustness (e.g., Bootstrapping and Bayesian inference). A popular transformation is the Box-Cox
scale power transformation given by:

x λ −1

TBC ( x, λ) = x (λ)
= λ when λ 6= 0;
loge x when λ = 0.
The R function symbox automatically produces boxplots of the variable to transform for different values of λ
to quickly gauge which power-law is more appropriate. For example, we can use this function on the Per
Person GDP (ppgdp) as follows:
library(car)
symbox(~ppgdp,data=UN)
Transformations of ppgdp

−1 −0.5 log 0.5 1

Powers
# Or we can look at income from the Prestige dataset
symbox(~ income, data=Prestige)

36
Transformations of income

−1 −0.5 log 0.5 1

Powers
Often times our variables take on negative values (e.g., S&P 500 returns), yet transformations such as logs
would not work. Instead we can use the Yeo-Johnson transformations TBC ( x + 1, λ) for nonnegative values
of x and TBC (− x + 1, 2 − λ) for negative values of x.
# The function yjPower takes as inputs your variable (x) and lambda,
# and outputs the respective transformed variable.
x=-5:5
yjPower(x, 3)

## [1] -0.8333333 -0.8000000 -0.7500000 -0.6666667 -0.5000000 0.0000000


## [7] 2.3333333 8.6666667 21.0000000 41.3333333 71.6666667
For regression purposes, transformations for normality, linearity and/or constant variance can be easily
implemented and tested statistically with the powerTransform function. For example, consider the variable
infant mortality from the UN dataset.
hist(UN$infantMortality)

37
Histogram of UN$infantMortality
60
50
Frequency

40
30
20
10
0

0 20 40 60 80 100 120

UN$infantMortality
# The histogram shows that the distribution is right-skewed. Therefore, we can try
# (1) test if a transformation is needed and (2) if it is needed, transform it.
p1 <- powerTransform(infantMortality ~ 1, data=UN, family="bcPower")
summary(p1)

## bcPower Transformation to Normality


## Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
## Y1 0.0468 0 -0.0879 0.1814
##
## Likelihood ratio test that transformation parameter is equal to 0
## (log transformation)
## LRT df pval
## LR test, lambda = (0) 0.4644634 1 0.49555
##
## Likelihood ratio test that no transformation is needed
## LRT df pval
## LR test, lambda = (1) 172.8143 1 < 2.22e-16
testTransform(p1, lambda=1/2)

## LRT df pval
## LR test, lambda = (0.5) 41.95826 1 9.3243e-11
In the case of Multiple Regression, we may need to transform more than one variable at the same time
(each with its own power-law). For example, we can look at the Multivariate Box Cox Highway1 data. The
following example if from the bcPower documentation:
# Multivariate Box Cox uses Highway1 data
# See: https://www.rdocumentation.org/packages/car/versions/2.1-6/topics/Highway1
summary(powerTransform(cbind(len, adt, trks, sigs1) ~ 1, Highway1))

## bcPower Transformations to Multinormality

38
## Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
## len 0.1439 0 -0.2728 0.5606
## adt 0.0876 0 -0.1712 0.3464
## trks -0.6954 0 -1.9046 0.5139
## sigs1 -0.2654 0 -0.5575 0.0267
##
## Likelihood ratio test that transformation parameters are equal to 0
## (all log transformations)
## LRT df pval
## LR test, lambda = (0 0 0 0) 6.014218 4 0.19809
##
## Likelihood ratio test that no transformations are needed
## LRT df pval
## LR test, lambda = (1 1 1 1) 127.7221 4 < 2.22e-16
# Multivariate transformation to normality within levels of 'htype'
summary(a3 <- powerTransform(cbind(len, adt, trks, sigs1) ~ htype, Highway1))

## bcPower Transformations to Multinormality


## Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
## len 0.1451 0.00 -0.2733 0.5636
## adt 0.2396 0.33 0.0255 0.4536
## trks -0.7336 0.00 -1.9408 0.4735
## sigs1 -0.2959 -0.50 -0.5511 -0.0408
##
## Likelihood ratio test that transformation parameters are equal to 0
## (all log transformations)
## LRT df pval
## LR test, lambda = (0 0 0 0) 13.1339 4 0.01064
##
## Likelihood ratio test that no transformations are needed
## LRT df pval
## LR test, lambda = (1 1 1 1) 140.5853 4 < 2.22e-16
# test lambda = (0 0 0 -1)
testTransform(a3, c(0, 0, 0, -1))

## LRT df pval
## LR test, lambda = (0 0 0 -1) 31.12644 4 2.8849e-06
# save the rounded transformed values, plot them with a separate
# color for each highway type
transformedY <- bcPower(with(Highway1, cbind(len, adt, trks, sigs1)),
coef(a3, round=TRUE))
scatterplotMatrix( ~ transformedY|htype, Highway1)

39
0 2 4 6 8 −6 −4 −2 0

len.0 FAI

1.0 2.0 3.0


MA
MC
PA

adt.0.33
8
4
0

2.6
trks.0

2.2
1.8
sigs1..0.5
−2
−6

1.0 2.0 3.0 1.8 2.0 2.2 2.4 2.6

4.3 Transforming Restricted-Range Variables

In regression models such as probit and logistic where the response variable is a probability, their range is
restricted to the interval [0, 1], however, this can present problems for such models because values can cluster
near the end points. Therefore, transformations that can spread the values such as the arcsine square-root are
good choices for these cases. This transformation is known for its variance stabilizing properties and is given
by:


Tasnisqrt ( x ) = sin−1 ( x )

40
par(mfrow=c(2,2))
hist(seq(0,1,length=100),main="Original", xlab="x")
hist(asin(sqrt(seq(0,1,length=100))),main="Transformed", xlab="x")
plot(seq(0,1,length=100),main="Original",type='l')
plot(asin(sqrt(seq(0,1,length=100))),main="Transformed", type='l')

Original Transformed

20
Frequency

Frequency
8

10
4
0

0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5

x x

Original asin(sqrt(seq(0, 1, length = 100))) Transformed


seq(0, 1, length = 100)

1.0
0.6
0.0

0.0

0 20 40 60 80 100 0 20 40 60 80 100

Index Index

Another popular transformation choice is the logit transformation:


 
x
Tlogit = logit(x) = ln
1−x
library(car)
attach(Prestige)
par(mfrow=c(1, 3))
densityPlot(~women,main="(a) Untransformed")
#densityPlot(~logit(women),main="(b) Logit",adjust=1)
densityPlot(~asin(sqrt(women/100)),main="(c) Arcsine square-root")

4.4 Transformations to Equalize Spread

Consider again the Ornstein dataset showing the number of interlocks by country.
Boxplot(interlocks ~ nation, data=Ornstein)

41
2

100 3
1
80
9
6
5
interlocks

60
40

13
27
20
0

CAN OTH UK US

nation
## [1] "1" "2" "3" "5" "6" "9" "13" "27"
From the figure we can see that it would be helpful to even out the observed spread across countries. A
measure of the spread-level can easily be obtain with Tukey’s Spread-Level plot which also provides a
suggested power-law transformation to stabilize the variance:
spreadLevelPlot(interlocks+1 ~ nation,Ornstein)

Spread−Level Plot for interlocks + 1 by nation

CAN
22

OTH
Hinge−Spread

16 18
14
12

US
10

UK

6 8 10 12 14 16

Median
## LowerHinge Median UpperHinge Hinge-Spread
## US 2 6.0 13 11

42
## UK 4 9.0 14 10
## CAN 6 13.0 30 24
## OTH 4 15.5 24 20
##
## Suggested power transformation: 0.1534487
According to the output, the choice of λ = 0.15 would be an appropriate one. Since this value is close to 0,
which would correspond to a log function, we can try a log to see if it works:
Boxplot(log10(interlocks+1) ~ nation, data=Ornstein)
2.0
log10(interlocks + 1)

1.5
1.0
0.5
0.0

CAN OTH UK US

nation

4.5 Transformations Toward Linearity

We can consider transforming both x and y in an effort to linearize the relationship between them. For
example, we can find the values of the positive exponents p and q such that the linear regression model
q p
Yi = β 0 + β 1 Xi + ε exhibits a more linear relationship. The choice of these exponents can be guided by
Mosteller & Tukey’s Bulging Rule for Linearization as shown in the figure below.

For example, below is a visualization of this rule taken from https://dzone.com/articles/tukey-and-


mosteller\T1\textquoterights-bulging

43
fakedataMT=function(p=1,q=1,n=99,s=.1){
set.seed(1)
X=seq(1/(n+1),1-1/(n+1),length=n)
Y=(5+2*Xˆp+rnorm(n,sd=s))ˆ(1/q)
return(data.frame(x=X,y=Y))}
par(mfrow=c(2,2))
plot(fakedataMT(p=.5,q=2),main="(p=1/2,q=2)")
plot(fakedataMT(p=3,q=-5),main="(p=3,q=-5)")
plot(fakedataMT(p=.5,q=-1),main="(p=1/2,q=-1)")
plot(fakedataMT(p=3,q=5),main="(p=3,q=5)")

(p=1/2,q=2) (p=3,q=−5)

0.68 0.71
2.5
y

y
2.3

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x x

(p=1/2,q=−1) (p=3,q=5)
0.18

1.38 1.44
y

y
0.14

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x x

44
In summary, when choosing a Functional form:
• We should do our best to choose a functional form that:
– Consistent with economic theory
– Fits the data well
– Such that the assumptions of the regression model are satisfied
• In real-world problems it is sometimes difficult to achieve all these goals.
• In applications of econometrics we must simply do the best we can to choose a satisfactory functional
form.

5 Regressions with Indicator Variables


Factor variables (also known as indicator or dummy variables) are used in regression to represent categorical
variables, i.e., non-quantitative characteristics such as gender, race, drink preference, etc. The simplest
example of a factor variable is a binary one, i.e., it can only take on two values (these are referred to as the
levels of the factor). For example,

1, if house is in University Town
UTOW N =
0, if house is in Golden Oaks

PRICE = β 1 + β 2 UTOW N + e
• When an indicator variable is used in a regression, it is important to write the regression function for
the different values of the indicator variable.

β 1 + β 2 , if UTOW N = 1
E [ PRICE] =
β1 , if UTOW N = 0

• In the simple regression model, an indicator variable on the right-hand side gives us a way to estimate
the differences between population means.
• It is straightforward to show that if x is an indicator variable, we have

b1 = y0
b2 = y1 − y0

In case you learned about the test involving the difference of two means from some elementary statistics
course before, you will realize that the regression here includes this earlier topic as a special case.
Next, we will go over an example in R. In this case we have binary indicator variable for gender, I, such that
I = 1 (gender = female) and I = 0 (gender = male). Our model is a regression of hourly wages (WAGE) on
gender (FEMALE = 1(female)). In this case our model can be expressed as:

β 0 + β 1 if FEMALE = 1
E(WAGE) = β 0 + β 1 FEMALE =
β0 if FEMALE = 0.

The respective the model parameters are given by β 0 = WAGE Male and β 1 = WAGE Female − WAGE Male .

45
# Example 8: Wages (y=wage) vs. Gender (x=female)
library(POE5Rdata)
data("cps5_small")
attach(cps5_small)
# Note: When dealing with indicator variables it is important to check the number of
# observations in each category to see if the data are balanced or not.
S(cps5_small[,1:4])

## black educ exper faminc


## Min. :0.0000 Min. : 0.0 Min. : 0.00 Min. : 0
## 1st Qu.:0.0000 1st Qu.:12.0 1st Qu.:12.00 1st Qu.: 0
## Median :0.0000 Median :14.0 Median :24.00 Median : 23679
## Mean :0.0875 Mean :14.2 Mean :23.37 Mean : 35304
## 3rd Qu.:0.0000 3rd Qu.:16.0 3rd Qu.:34.00 3rd Qu.: 53029
## Max. :1.0000 Max. :21.0 Max. :62.00 Max. :469000
S(cps5_small[,5:9])

## female metro midwest south


## Min. :0.00 Min. :0.0000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.00 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.000
## Median :0.00 Median :1.0000 Median :0.0000 Median :0.000
## Mean :0.44 Mean :0.8217 Mean :0.2475 Mean :0.325
## 3rd Qu.:1.00 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:1.000
## Max. :1.00 Max. :1.0000 Max. :1.0000 Max. :1.000
## wage
## Min. : 3.94
## 1st Qu.: 13.00
## Median : 19.30
## Mean : 23.64
## 3rd Qu.: 29.80
## Max. :221.10
# Did you find any indicator variables that are not well balanced?

wage_female = wage[which(female==1)]
wage_male = wage[which(female==0)]

# First we will look at the distribution of wages conditioned on gender


par(mfrow=c(2,1))
hist(wage_female,breaks ="FD",col="skyblue2", freq = FALSE, ylab = "Density",
main = "Histogram of the Wages (Female)",xlab="Wages",ylim=c(0,0.06),xlim=c(0,60))
lines(density(wage_female),lwd = 2, col ="red")

hist(wage_male,breaks ="FD",col="skyblue2", freq = FALSE, ylab = "Density",


main = "Histogram of the Wages (Male)",xlab="Wages",ylim=c(0,0.06),xlim=c(0,60))
lines(density(wage_female),lwd = 2, col ="red")

46
Density Histogram of the Wages (Female)
0.00

0 10 20 30 40 50 60

Wages

Histogram of the Wages (Male)


Density

0.00

0 10 20 30 40 50 60

Wages
# Now we can estimate the model and interpret the results
reg.mod = lm(wage~female)
S(reg.mod)

## Call: lm(formula = wage ~ female)


##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.7624 0.5852 42.315 < 2e-16 ***
## female -2.5507 0.8822 -2.891 0.00391 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard deviation: 15.17 on 1198 degrees of freedom
## Multiple R-squared: 0.006929
## F-statistic: 8.359 on 1 and 1198 DF, p-value: 0.003906
## AIC BIC
## 9935.83 9951.10
# We interpret the y-intercept as the mean hourly wage for men (beta_0 = $11.52), and
# the slope as the difference in hourly wages between women and men (beta_1 = $-2.66).
# On average, all other things being equal, women are predicted to earn $2.66/hr less
# than men.

47

You might also like