Chapter 2
Chapter 2
Contents
1 Univariate Characterizations 1
1.1 Statistical Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Quantile Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Bivariate Characterizations 9
2.1 Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Jittering Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1 Univariate Characterizations
In a regression setting we first start by looking at a univariate characterization of the data. This entails
looking at the respective statistical summaries, features, and distributions of all the variables we have.
1
Among the popular univariate graphical tools, we will consider histograms, density plots, qqplots, and
boxplots.
Looking at the summary statistics of our data allows to identify any potential issues we may encounter when
building our regression model. For example, there may NAs, values outside the expected range, etc. Below
are two examples using the pastecs and psych libraries.
library(pastecs)
# Summary for 1 variable:
stat.desc(rnorm(100),norm=TRUE)
library(psych)
describe(Prestige)
1.2 Histograms
Histograms should be one our first analysis tool since they can help us learn about each variable’s properties.
Things to look for:
• Outliers: This can influence the regression fit in many ways
2
• Shape: Is the distribution symmetric, skewed, etc?
• Range: Is the range of values very large? For example several orders of magnitude?
• Density of observations and/or clustering features. Are the data all clustered in certain regions? In high density
regions your regression fit may be robust but then become unstable in less dense ones.
The near optimal number of bins to use based on the number of observations n is given by:
k = 1 + log2 (n)
However, more sample-specific formulas exist such as Freedman’s and Diaconis (FD)’ suggested:
n1/3 (max − min)
2( Q3 − Q1 )
For example, we can compare the two methods for 2 different data sets:
#Compare histograms using k vs FD:
#quartz()
par(mfrow=c(2,2))
n = 1000;k = 1 + log2(n)
hist(rnorm(n),breaks = k,col="skyblue4", main = "Normal Distr. Using k ")
hist(rnorm(n),breaks = "FD",col="skyblue4", main = "Normal Distr. Using FD")
n = 1000;k = 1 + log2(n)
hist(rexp(n,0.5),breaks = k,col="skyblue2", main ="Exp Distr. Using k")
hist(rexp(n,0.5),breaks = "FD",col="skyblue2", main ="Exp Distr. Using FD")
80
Frequency
Frequency
40
0 150
−4 −2 0 2 4 −4 −3 −2 −1 0 1 2 3
rnorm(n) rnorm(n)
Frequency
200
100
0
0 2 4 6 8 10 12 0 5 10 15
Histograms are very informative but being able to visualize the density plot, along with the histogram and
1D scatterplot (rug-plot) can be more informative. A common kernel-density (nonparametric) estimate used
is given by:
3
n
x − xi
1
pb( x ) =
nh ∑K h
i =1
where K is kernel function (e.g., Normal distribution), h is the bandwidth (amount of smoothness), and n the
number of observations.
#quartz()
n=1000
hist(rexp(n,0.5),breaks ="FD",col="skyblue2", freq = FALSE, ylab = "Density")
lines(density(rexp(n,0.5)),lwd = 2, col ="red")
lines(density(rexp(n,0.5), adjust =0.5),lwd = 2, col = "blue", type ='l', lty=2)
rug(rexp(n,0.5))
0.2
0.1
0.0
0 2 4 6 8 10 12
rexp(n, 0.5)
#You can also show the density with color
library(ggplot2)
data=data.frame(value=rnorm(10000))
ggplot(data, aes(x=value))+geom_histogram(binwidth = 0.2, aes(fill = ..count..) )
4
800
600
count
600
count
400
400
200
0
200
−4 −2 0 2 4
value
These plots are very useful for comparing the distribution of a variable to a theoretical reference distribution.
By default the Normal distribution is used in the function ‘qqPlot’ (Quantile - Quantile) but we can provide
any other distribution.
#The dataset Prestige from the car package has 102 observations
library(car)
attach(Prestige)
qqPlot(~ income, data=Prestige, id=list(n=3))
5
25000
general.managers
physicians
lawyers
15000
income
5000
0
−2 −1 0 1 2
norm quantiles
## general.managers physicians lawyers
## 2 24 17
qqPlot(income ~ type, data=Prestige, layout=c(1, 3))
general.managers
25000
25000
25000
physicians
20000
20000
20000
15000
15000
15000
income
income
income
10000
10000
10000
firefighters commercial.travellers
insurance.agents
5000
5000
5000
farm.workers
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
6
Studentized Residuals(lm(prestige ~ income + education + type, data = Duncan))
qqPlot(lm(prestige ~ income + education + type, data=Duncan),
envelope=.99)
4
minister
3
machinist
2
1
0
−1
−2 −1 0 1 2
t Quantiles
## minister machinist
## 6 28
1.5 Boxplots
Boxplots allow us know more quantitative aspects of a variables distribution such as the min, max, Q1,
Q3, IQR, and Median. Another helpful property is that they are very convenient when comparing many
distributions simultaneously and/or conditioning on other variables.
#Boxplot of Income
Boxplot(~income, data=Prestige)
7
25000
general.managers
physicians
lawyers
osteopaths.chiropractors
15000
income
veterinarians
5000
0
2
100
3
1
Ornstein$interlocks
80
9
6
5
60
40
13
27
20
0
CAN OTH UK US
Ornstein$nation
## [1] "1" "2" "3" "5" "6" "9" "13" "27"
# Note: interlocks = Number of interlocking director and executive positions
# shared with other major firms.
library("plotrix")
means <- Tapply(interlocks ~ nation, mean, data=Ornstein)
sds <- Tapply(interlocks ~ nation, sd, data=Ornstein)
8
plotCI(1:4, means, sds, xaxt="n", xlab="Nation of Control",
ylab="interlocks", ylim=range(Ornstein$interlocks))
lines(1:4, means)
axis(1, at=1:4, labels = names(means))
100
80
interlocks
60
40
20
0
CAN OTH UK US
Nation of Control
2 Bivariate Characterizations
Once you have the univariate description of the variables, we can then exam the pairwise associations via
e.g., scatterplots
2.1 Scatterplots
library(car)
attach(Prestige)
scatterplot(prestige ~income, lwd=3, id=TRUE)
9
when looking at
graphs, look fo
relationship = r line and shap
positive trend e
too much discre
ptancy betwee
model is not a n line and shap
good fit e of plots sugg
ests that the
24
80
60 2
prestige
40
20
income
## [1] 2 24
#Note: Lowess smoother is a 'locally weighted smoother'
type
bc prof wc
24
20 30 40 50 60 70 80
31 2
prestige
82
58
51
income
10
## 31 58 51 82
## 1 2 3 19 24 25
#If we have several predictors in our model, it is more practical to
# simply look at as scatterplot matrix of the data. For example:
# scatterplotMatrix(~ prestige + income + education + women)
prestigebc
80
prof
50
wc
20
20000
income
5000
education
14
10
6
women
80
40
CAN
0
20 40 60 80 6 8 10 12 14 16
11
10
8
Vocab$vocabulary
6
4
2
0
0 5 10 15 20
Vocab$education
#Add some random noise to the variables --> jitter the data
plot(jitter(Vocab$vocabulary)~jitter(Vocab$education))
10
jitter(Vocab$vocabulary)
8
6
4
2
0
0 5 10 15 20
jitter(Vocab$education)
# Increase the amount of jitter
plot(jitter(Vocab$vocabulary,factor =2)~jitter(Vocab$education, factor =2),col="gray",cex=0.5)
abline(lm(Vocab$vocabulary~ Vocab$education),lwd= 3, lty="dashed")
lines(lowess(Vocab$education,Vocab$vocabulary,f=0.2), lwd =3, col="red")
12
jitter(Vocab$vocabulary, factor = 2)
10
8
6
4
2
0
0 5 10 15 20
jitter(Vocab$education, factor = 2)
13
Table 1: Terminology:
y x
Dependent Variable Independent Variable
Explained Variable Explanatory Variable
Response Variable Control Variable
Predicted Variable Predictor Variable
Regressand Regressor
The Econometric Model is the motivation, e.g., it could be a supply equation, therefore, it comes from economic
theory. The parameters of the model are also known as the population parameters. For example, let’s
examine the relationship between weekly food expenditure and weekly household income.
• A household spends $80 plus 10 cents of each dollar of income received on food.
• Algebraically the rule is y = 80 + 0.10x where y = weekly household food expenditure ($) and x =
weekly household income ($).
• In reality, many factors may affect household expenditure on food.
• Let e (error term) = everything else affecting food other than income.
• y = β 1 + β 2 x + e. This equation is the simple regression model.
• A simple linear regression analysis examines the relationship between a y-variable and one x-variable.
R Example
library(POE5Rdata)
# Histograms
data("food")
hist(food$income,col="skyblue3", main = "PDF of Weekly Income = f(x)")
14
PDF of Weekly Income = f(x)
12
10
8
Frequency
6
4
2
0
0 5 10 15 20 25 30 35
food$income
hist(food$food_exp, col="skyblue3", main = "PDF of Weekly Food Expennditure = f(y|x)")
4
3
2
1
0
food$food_exp
hist(food$food_exp[food$income<=20],col="skyblue3", main = "PDF of f(y|x<2000)")
15
PDF of f(y|x<2000)
5
4
Frequency
3
2
1
0
PDF of f(y|x>2000)
6
5
4
Frequency
3
2
1
0
16
abline(h=0,col="red", lwd=2)
200
100
Resdiduals
0
−100
−200
5 10 15 20 25 30
WeeklyIncome ($100s)
# Note: From the residuals plot we can see that the variance is not constant
y i = β 1 + β 2 x i + ei
In R (and most other software packages) the default null hypothesis tested is H0 : β k = 0 against the
alternative H1 : β k 6= 0, k = 1, 2. Therefore, if the respective estimated p-value is small (p−value < α), we
reject H0 and conclude that the respective parameter is statistically significantly different from 0.
The Simple Linear Regression (SR) assumptions of the model are that:
1. *Linearity:* The value of y, for each value of x, is y = β 0 + β 1 x + e
2. E[ei | xi ] = 0 which is the same as E[y] = β 0 + β 1 x
3. *Homoskedasticity:* Var[ei | xi ] = Var[y] = σ2 (constant variance condition, i.e., the errors are *ho-
moskedastic*)
4. cov(yi , y j ) = cov(ei , e j | x ) = 0 (even better if the errors are independent which is a stronger condition)
5. *Rank Condition:* x is not random and takes on at least two different values
6. *Normality (Optional)*: The errors are normally distributed, i.e., ei | xi ∼ N (0, σ2 )
Note: Since the data {( x1 , y1 ), . . . , ( xn , yn )} is random sample, both x’s and y’s are independent and identi-
cally distributed (iid).
17
3.1.2 The Random Error and Strict Exogeneity
• The second assumption of the simple regression model concerns the “everything else” term e.
• Unlike food expenditure and income, the random error term is not observable; it is unobservable.
• The x-variable, income, cannot be used to predict the value of e.
• E [ei | xi ] = 0 has two implications:
– E [ ei ] = 0
– cov (ei , xi ) = 0
To better understand the error term we can think of decomposing the observation y into 2 components:
1. E[y| x ] = β 1 + β 2 x, which represents the systematic component of y, and
2. e = y − E[y| x ] = y − β 1 − β 2 x, which represents the random component and thus, encapsulates
everything else that contributes to the variations in y. Therefore,
y = β1 + β2 x + e
|{z} .
|{z} | {z }
Data Model Uncertainty
R Example
library(POE5Rdata)
data("food")
# Test the covariance assumption
ols.mod <- lm(food_exp~income, data=food)
cov(food$income, ols.mod$residuals) #Seems close enough to 0
## [1] -1.166019e-14
# Test E[e] = 0
mean(ols.mod$residuals) #Seems close enough to 0
## [1] -4.973799e-15
18
• The fitted regression line is
ybi = b1 + b2 xi
and the residual is
ei = yi − ybi = yi − b1 − b2 xi
b
0 = ∑ (yi − b1 − b2 xi ) = ∑ bei
i i
0 = ∑ xi (yi − b1 − b2 xi ) = ∑ xi bei
i i
∑ ( xi − x ) ( yi − y )
b2 = 2
∑ ( xi − x )
b1 = y − b2 x
the ordinary least squares estimators. “Ordinary least squares” is abbreviated as OLS.
• It can be shown that
N
b2 = ∑ wi y i , where,
i =1
( xi − x )
wi = 2
∑ ( xi − x )
19
E [b2 ] = β2
i.e., it is unbiased
• It can further be shown that
σ2
Var (b2 ) = 2
∑ ( xi − x )
∑ xi2
Var (b1 ) = σ2
N ∑ ( x i − x )2
x
cov (b1 , b2 ) = − σ2
N ∑ ( x i − x )2
• Note that
1. The larger the variance term, the greater the uncertainty there is in the statistical model, and the
larger the variances and covariance of the least squares estimators.
2. The larger the sum of squares ∑ ( xi − x )2 , the smaller the variances of the least squares estimators
and the more precisely we can estimate the unknown parameters.
3. The larger the sample size N, the smaller the variances and covariance of the least squares
estimators.
If this assumption is violated, and Var [ei | xi ] may depend on xi , then the random errors are said to be
heteroskedastic.
R Example
library(POE5Rdata)
data("food")
#A plot of the estimate of the errors, i.e., the residuals, shows that they are heteroskedastic.
ols.mod <- lm(food_exp~income, data=food)
plot(food$income, ols.mod$residuals, pch=20,ylab="Resdiduals", xlab="WeeklyIncome ($100s)")
abline(h=0,col="red", lwd=2)
20
200
100
Resdiduals
0
−100
−200
5 10 15 20 25 30
WeeklyIncome ($100s)
# Note: From the residuals plot we can see that the variance is not constant
3.1.5 Variation in x
21
3.1.6 Error Normality
• It is not at all necessary for the random errors to be conditionally normal in order for regression analysis
to “work”.
• When samples are small, it is advantageous for statistical inferences that the random errors, and
dependent variable y, given each x-value, are normally distributed.
• Central Limit Theorem says roughly that collections of many random factors tend toward having a
normal distribution.
• It is possible that there are correlations between the random error terms.
• With cross-sectional data, data collected at one point in time, there may be a lack of statistical indepen-
dence between random errors for individuals who are spatially connected.
• Within a larger sample of data, there may be clusters of observations with correlated errors because of
the spatial component.
• The starting point in regression analysis is to assume that there is no error correlation.
• We can use the sample data values of yi and xi from the food data set, to estimate the unknown
regression parameters β 1 and β 2 .
• These parameters represent the unknown intercept and slope coefficients for the food expenditure–
income relationship.
• If we represent the 40 data points as (yi , xi ), i = 1, . . . , 40, and plot them, we obtain the scatter diagram
below.
• Our problem is to estimate the location of the mean expenditure line.
• We would expect this line to be somewhere in the middle of all the data points.
22
23
R Example
library(POE5Rdata)
data("food")
ols.mod <- lm(food_exp~income, data=food)
plot(food$income,food$food_exp, pch=20,
xlab="Weekly Income ($100s)", ylab="Weekly Food Expenditure ($)")
abline(ols.mod,lwd=2,col="blue")
legend(6, 550, legend="Linear Relatonship Model: Mean Expennditure Line",col=c("blue"),lty=1, cex=0.8)
600
Weekly Food Expenditure ($)
5 10 15 20 25 30
R Example
library(POE5Rdata)
# First we look at the histograms
data("food")
##
## Call:
## lm(formula = food_exp ~ income, data = food)
##
## Residuals:
## Min 1Q Median 3Q Max
## -223.025 -50.816 -6.324 67.879 212.044
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 83.416 43.410 1.922 0.0622 .
## income 10.210 2.093 4.877 1.95e-05 ***
## ---
24
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 89.52 on 38 degrees of freedom
## Multiple R-squared: 0.385, Adjusted R-squared: 0.3688
## F-statistic: 23.79 on 1 and 38 DF, p-value: 1.946e-05
names(ols.mod)
## (Intercept)
## 83.416
ols.mod$coefficients[2]
## income
## 10.20964
# If you just want the values: ols.mod$coefficients[[1]], ols.mod$coefficients[[2]]
• Under our assumptions of the linear regression model, the estimators b1 and b2 have the smallest
variance of all linear and unbiased estimators of b1 and b2 . They are the Best Linear Unbiased estimators
(BLUE) of β 1 and β 2 .
• The estimators b1 and b2 are best within their class because they have the minimum variance. When
comparing two linear and unbiased estimators, we always want to use the one with the smaller
variance, because that estimation rule gives us the higher probability of obtaining an estimate that is
close to the true parameter value.
• For this to be the case, all of our assumptions should be true, except for the normality. For example, we
need homoskedasticity.
• A (graduate level version of the) central limit theorem says that the above is approximately correct
even if the errors are not normally distributed, as long as N is sufficiently large.
• We can estimate σ2 by
∑iN=1 b
ei2
σ2 =
N−2
b
25
where
ei
b = yi − ybi = yi − b1 − b2 xi
σ2 is unbiased for σ2 .
• It can be shown that b
• Note that the denominator is N − 2, not N.
• We can then construct
σ2
b
d (b2 )
Var = 2
∑ ( xi − x )
∑ xi2
d (b1 )
Var = σ2
N ∑ ( x i − x )2
b
x
ov (b1 , b2 )
cd = −bσ2
N ∑ ( x i − x )2
• We often write
q
se (b2 ) = d (b2 )
Var
q
se (b1 ) = d (b1 )
Var
R Example
library(POE5Rdata)
data("food")
## (Intercept) income
## (Intercept) 1884.44226 -85.903157
## income -85.90316 4.381752
# If you just want the values: ols.mod$coefficients[[1]], ols.mod$coefficients[[2]]
In general, for any regression model estimated, it is a good idea to also include bootstrap estimates as way to
test for the robustness of the estimates. The core principle being simple random sampling with replacement
from the original dataset, and for each sample estimating the same regression model and then collecting the
usual statistics.
R Example
library(boot)
library(car)
set.seed(3435)
library(POE5Rdata)
data("food")
trans.mod = lm(food_exp~income, data=food)
26
betahat.boot = Boot(trans.mod, R=999)
usualEsts = summary(trans.mod)$coef[, 1:2]
summary(betahat.boot)
##
## Number of bootstrap replications R = 999
## original bootBias bootSE bootMed
## (Intercept) 83.416 -0.487805 31.391 81.858
## income 10.210 0.041439 1.974 10.241
confint(betahat.boot)
0.20
0.010
Density
Density
0.10
0.005
0.000
0.00
(Intercept) income
# Bootstrap for the estimated residual standard deviation:
sigmahat.boot <- Boot(trans.mod, R=199, f=sigmaHat, labels="sigmaHat")
summary(sigmahat.boot)
27
## 2.5 % 97.5 %
## sigmaHat 77.43262 109.2678
R Example
library(boot)
# function to obtain R-Squared from the data
rsq <- function(formula, data, indices) {
d <- data[indices,] # allows boot to select sample
fit <- lm(formula, data=d)
return(summary(fit)$r.square)
}
library(POE5Rdata)
data("food")
results <- boot(data=food, statistic=rsq, R=1000, formula=food_exp~income)
ci1=boot.ci(results, type="bca", index=1,conf=0.95)
ci_rsq = ci1$bca[ , c(4, 5)]
2
1
0
R− Squared
28
4 Estimating Nonlinear Relationships
When the data are not suitable for a linear model, we can consider non-linear transformations on y and/or x.
In econometrics 4 very popular non-linear models are the (1) log-log, (2) log-linear, (3) linear-log, and (4)
quadratic models. The table below shows a summary of these models and their respective interpretations.
Recall that the elasticity, ε, can be computed as ε = slope × yx . For example, a quadratic model for the house
prices such as
PRICE = α1 + α2 SQFT 2 + e
d PRICE
\
= 2b
α2 SQFT
dSQFT
R Example
library(POE5Rdata)
# Home Prices (y=price) vs. Living Area (x=sqft)
data(br)
reg.mod <- lm(price~I(sqftˆ2), data=br)
S(reg.mod)
29
## Call: lm(formula = price ~ I(sqft^2), data = br)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.578e+04 2.890e+03 19.30 <2e-16 ***
## I(sqft^2) 1.542e-02 3.131e-04 49.25 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard deviation: 68210 on 1078 degrees of freedom
## Multiple R-squared: 0.6923
## F-statistic: 2426 on 1 and 1078 DF, p-value: < 2.2e-16
## AIC BIC
## 27110.35 27125.30
b1 <- coef(reg.mod)[[1]]
b2 <- coef(reg.mod)[[2]]
plot(br$sqft, br$price, xlab="Area (ftˆ2)", ylab="Home Price ($)",pch=20)
curve(b1+b2*xˆ2, col="red", add=TRUE,lwd=2)
1000000 1500000
Home Price ($)
500000
0
Area (ft^2)
# Q1: Is the quadratic fit a good one?
# A1: From the scatterplot we can see that for
# homes with sqft>4000, the spread increases and the number
# of observations decreases.
# This suggests that our fit will worsen as we consider
# larger values of x beyond ~ 4000ftˆ2.
#Q2: Are the residuals normally distributed?
hist(reg.mod$residuals,breaks ="FD",col="skyblue2", freq = FALSE, ylab = "Density",
main = "Histogram of the Residuals")
lines(density(reg.mod$residuals),lwd = 2, col ="red")
30
Histogram of the Residuals
0.0e+00 4.0e−06 8.0e−06 1.2e−05
Density
reg.mod$residuals
library(tseries)
jarque.bera.test(reg.mod$residuals)
##
## Jarque Bera Test
##
## data: reg.mod$residuals
## X-squared = 61385, df = 2, p-value < 2.2e-16
# We can also compute traditional metrics such as predicted
# values and elasticities
sqftx=c(2000, 4000, 6000) #given values for sqft
pricex=b1+b2*sqftxˆ2 #prices corresponding to given sqft
DpriceDsqft <- 2*b2*sqftx # marginal effect of sqft on price
elasticity=DpriceDsqft*sqftx/pricex
cat("DpriceDsqft (x =2000, 4000, 6000) =", DpriceDsqft)
R Example
# Weekly Food Expenditure (y=food_exp in $) vs. Income (x=income in units of $100)
# We will consider a Linear-Log Model: y = beta_0 + beta_1 log(x)
#library(PoEdata)
data(food)
head(food)
## food_exp income
31
## 1 115.22 3.69
## 2 135.98 4.39
## 3 119.34 4.75
## 4 114.96 6.03
## 5 187.05 12.47
## 6 243.92 12.98
plot(food$income, food$food_exp, xlab="Income ($100)", ylab="Food Expenditure ($)",pch=20)
reg.mod <- lm(food$food_exp~log(food$income))
S(reg.mod)
400
300
200
100
5 10 15 20 25 30
Income ($100)
#Interpretation 1: For a household with $1000 weekly income,
# we expect an additional weekly food spending of $13.22
# for an additional $100 increase in weekly income.
#Interpretation 2: For a 1% increase in income,
32
# we expect food expenditure to increase by $1.32/week.
plot(food$income,reg.mod$residuals,pch=20, ylab="Residuals", xlab="Income")
abline(h=0, lwd=2, col="red")
200
100
Residuals
0
−200 −100
5 10 15 20 25 30
Income
#Q: Are the residuals normally distributed?
hist(reg.mod$residuals,breaks ="FD",col="skyblue2", freq = FALSE, ylab = "Density",
main = "Histogram of the Residuals",xlab="Residuals")
lines(density(reg.mod$residuals),lwd = 2, col ="red")
0.002
0.000
Residuals
33
library(tseries)
jarque.bera.test(reg.mod$residuals)
##
## Jarque Bera Test
##
## data: reg.mod$residuals
## X-squared = 0.19989, df = 2, p-value = 0.9049
4.1 Logarithms
In econometrics logarithms are very useful when dealing with variables that have span different orders
of magnitude yet must be included in the regression model. For example consider a model that includes
the predictors GDP ( 1010 ) and real interest rates ( 10−2 ). Logarithms can help us mitigate violations to the
constant variance assumption often used in Least Squares. By default R uses natural logs when you use the
log function.
# Here we look at the Ornstein data which includes financial assets of 248 Canadian companies.
par(mfrow=c(1, 2), mar=c(5, 4, 6, 2) + 0.1)
densityPlot(~ assets, data=Ornstein, xlab="assets", main="Untransformed")
densityPlot(~ log10(assets), data=Ornstein, adjust=0.65,
xlab=expression(log[10]~"(assets)"), main="Log Plot")
basicPowerAxis(0, base=10, side="above", at=10ˆ(2:5),
axis.title="")
0.6
Density
Density
0.00015
0.4
0.2
0.00000
0.0
0 50000 150000 2 3 4 5
34
Untransformed
100 120
80
60
40
20
0
Log−Log Plot
Infant Mortality Rate (per 1000 births)
100
Gabon
20
10
5
2
35
## Angola Equatorial Guinea Gabon
## 4 54 62
Power transformations of the predictors and/or response variable(s) can help improve their respective
distributions for modeling/estimation purposes (e.g., Least Squares estimates of regression coefficients) and
sampling/robustness (e.g., Bootstrapping and Bayesian inference). A popular transformation is the Box-Cox
scale power transformation given by:
x λ −1
TBC ( x, λ) = x (λ)
= λ when λ 6= 0;
loge x when λ = 0.
The R function symbox automatically produces boxplots of the variable to transform for different values of λ
to quickly gauge which power-law is more appropriate. For example, we can use this function on the Per
Person GDP (ppgdp) as follows:
library(car)
symbox(~ppgdp,data=UN)
Transformations of ppgdp
Powers
# Or we can look at income from the Prestige dataset
symbox(~ income, data=Prestige)
36
Transformations of income
Powers
Often times our variables take on negative values (e.g., S&P 500 returns), yet transformations such as logs
would not work. Instead we can use the Yeo-Johnson transformations TBC ( x + 1, λ) for nonnegative values
of x and TBC (− x + 1, 2 − λ) for negative values of x.
# The function yjPower takes as inputs your variable (x) and lambda,
# and outputs the respective transformed variable.
x=-5:5
yjPower(x, 3)
37
Histogram of UN$infantMortality
60
50
Frequency
40
30
20
10
0
0 20 40 60 80 100 120
UN$infantMortality
# The histogram shows that the distribution is right-skewed. Therefore, we can try
# (1) test if a transformation is needed and (2) if it is needed, transform it.
p1 <- powerTransform(infantMortality ~ 1, data=UN, family="bcPower")
summary(p1)
## LRT df pval
## LR test, lambda = (0.5) 41.95826 1 9.3243e-11
In the case of Multiple Regression, we may need to transform more than one variable at the same time
(each with its own power-law). For example, we can look at the Multivariate Box Cox Highway1 data. The
following example if from the bcPower documentation:
# Multivariate Box Cox uses Highway1 data
# See: https://www.rdocumentation.org/packages/car/versions/2.1-6/topics/Highway1
summary(powerTransform(cbind(len, adt, trks, sigs1) ~ 1, Highway1))
38
## Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
## len 0.1439 0 -0.2728 0.5606
## adt 0.0876 0 -0.1712 0.3464
## trks -0.6954 0 -1.9046 0.5139
## sigs1 -0.2654 0 -0.5575 0.0267
##
## Likelihood ratio test that transformation parameters are equal to 0
## (all log transformations)
## LRT df pval
## LR test, lambda = (0 0 0 0) 6.014218 4 0.19809
##
## Likelihood ratio test that no transformations are needed
## LRT df pval
## LR test, lambda = (1 1 1 1) 127.7221 4 < 2.22e-16
# Multivariate transformation to normality within levels of 'htype'
summary(a3 <- powerTransform(cbind(len, adt, trks, sigs1) ~ htype, Highway1))
## LRT df pval
## LR test, lambda = (0 0 0 -1) 31.12644 4 2.8849e-06
# save the rounded transformed values, plot them with a separate
# color for each highway type
transformedY <- bcPower(with(Highway1, cbind(len, adt, trks, sigs1)),
coef(a3, round=TRUE))
scatterplotMatrix( ~ transformedY|htype, Highway1)
39
0 2 4 6 8 −6 −4 −2 0
len.0 FAI
adt.0.33
8
4
0
2.6
trks.0
2.2
1.8
sigs1..0.5
−2
−6
In regression models such as probit and logistic where the response variable is a probability, their range is
restricted to the interval [0, 1], however, this can present problems for such models because values can cluster
near the end points. Therefore, transformations that can spread the values such as the arcsine square-root are
good choices for these cases. This transformation is known for its variance stabilizing properties and is given
by:
√
Tasnisqrt ( x ) = sin−1 ( x )
40
par(mfrow=c(2,2))
hist(seq(0,1,length=100),main="Original", xlab="x")
hist(asin(sqrt(seq(0,1,length=100))),main="Transformed", xlab="x")
plot(seq(0,1,length=100),main="Original",type='l')
plot(asin(sqrt(seq(0,1,length=100))),main="Transformed", type='l')
Original Transformed
20
Frequency
Frequency
8
10
4
0
0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5
x x
1.0
0.6
0.0
0.0
0 20 40 60 80 100 0 20 40 60 80 100
Index Index
Consider again the Ornstein dataset showing the number of interlocks by country.
Boxplot(interlocks ~ nation, data=Ornstein)
41
2
100 3
1
80
9
6
5
interlocks
60
40
13
27
20
0
CAN OTH UK US
nation
## [1] "1" "2" "3" "5" "6" "9" "13" "27"
From the figure we can see that it would be helpful to even out the observed spread across countries. A
measure of the spread-level can easily be obtain with Tukey’s Spread-Level plot which also provides a
suggested power-law transformation to stabilize the variance:
spreadLevelPlot(interlocks+1 ~ nation,Ornstein)
CAN
22
OTH
Hinge−Spread
16 18
14
12
US
10
UK
6 8 10 12 14 16
Median
## LowerHinge Median UpperHinge Hinge-Spread
## US 2 6.0 13 11
42
## UK 4 9.0 14 10
## CAN 6 13.0 30 24
## OTH 4 15.5 24 20
##
## Suggested power transformation: 0.1534487
According to the output, the choice of λ = 0.15 would be an appropriate one. Since this value is close to 0,
which would correspond to a log function, we can try a log to see if it works:
Boxplot(log10(interlocks+1) ~ nation, data=Ornstein)
2.0
log10(interlocks + 1)
1.5
1.0
0.5
0.0
CAN OTH UK US
nation
We can consider transforming both x and y in an effort to linearize the relationship between them. For
example, we can find the values of the positive exponents p and q such that the linear regression model
q p
Yi = β 0 + β 1 Xi + ε exhibits a more linear relationship. The choice of these exponents can be guided by
Mosteller & Tukey’s Bulging Rule for Linearization as shown in the figure below.
43
fakedataMT=function(p=1,q=1,n=99,s=.1){
set.seed(1)
X=seq(1/(n+1),1-1/(n+1),length=n)
Y=(5+2*Xˆp+rnorm(n,sd=s))ˆ(1/q)
return(data.frame(x=X,y=Y))}
par(mfrow=c(2,2))
plot(fakedataMT(p=.5,q=2),main="(p=1/2,q=2)")
plot(fakedataMT(p=3,q=-5),main="(p=3,q=-5)")
plot(fakedataMT(p=.5,q=-1),main="(p=1/2,q=-1)")
plot(fakedataMT(p=3,q=5),main="(p=3,q=5)")
(p=1/2,q=2) (p=3,q=−5)
0.68 0.71
2.5
y
y
2.3
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
(p=1/2,q=−1) (p=3,q=5)
0.18
1.38 1.44
y
y
0.14
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
44
In summary, when choosing a Functional form:
• We should do our best to choose a functional form that:
– Consistent with economic theory
– Fits the data well
– Such that the assumptions of the regression model are satisfied
• In real-world problems it is sometimes difficult to achieve all these goals.
• In applications of econometrics we must simply do the best we can to choose a satisfactory functional
form.
PRICE = β 1 + β 2 UTOW N + e
• When an indicator variable is used in a regression, it is important to write the regression function for
the different values of the indicator variable.
β 1 + β 2 , if UTOW N = 1
E [ PRICE] =
β1 , if UTOW N = 0
• In the simple regression model, an indicator variable on the right-hand side gives us a way to estimate
the differences between population means.
• It is straightforward to show that if x is an indicator variable, we have
b1 = y0
b2 = y1 − y0
In case you learned about the test involving the difference of two means from some elementary statistics
course before, you will realize that the regression here includes this earlier topic as a special case.
Next, we will go over an example in R. In this case we have binary indicator variable for gender, I, such that
I = 1 (gender = female) and I = 0 (gender = male). Our model is a regression of hourly wages (WAGE) on
gender (FEMALE = 1(female)). In this case our model can be expressed as:
β 0 + β 1 if FEMALE = 1
E(WAGE) = β 0 + β 1 FEMALE =
β0 if FEMALE = 0.
The respective the model parameters are given by β 0 = WAGE Male and β 1 = WAGE Female − WAGE Male .
45
# Example 8: Wages (y=wage) vs. Gender (x=female)
library(POE5Rdata)
data("cps5_small")
attach(cps5_small)
# Note: When dealing with indicator variables it is important to check the number of
# observations in each category to see if the data are balanced or not.
S(cps5_small[,1:4])
wage_female = wage[which(female==1)]
wage_male = wage[which(female==0)]
46
Density Histogram of the Wages (Female)
0.00
0 10 20 30 40 50 60
Wages
0.00
0 10 20 30 40 50 60
Wages
# Now we can estimate the model and interpret the results
reg.mod = lm(wage~female)
S(reg.mod)
47