1 - Simple Linear Regression
1 - Simple Linear Regression
and Correlation
x y
2
Example 1
3
Suppose we didn’t know about the equation
and wanted to establish the relationship be-
tween x and y by experimentation.
4
Scatterplot for gravity experiment
6
5
4
time
3
2
1
0
5 10 15 20
height
5
Example 2
6
Galton’s father-son height data
75
son’s height
70
65
60
60 65 70 75
father’s height
8
Initially we study a fairly simple situation in
which the relation between y and x is ≈ linear:
y ≈ ax + b.
Statistical model
Y i = β 0 + β 1 x i + ǫi , i = 1, . . . , n.
9
The following assumptions are made about the
preceding model:
Given the data (x1, y1), (x2, y2), . . . , (xn, yn), find
the values of b0 and b1 such that
n
[yi − (b0 + b1xi)]2
X
f (b0 , b1) =
i=1
is minimized.
12
Illustration of the Least Squares Principle
•
0.18
•
0.16
•
•
•
0.14
• •
• •
ozone
0.12
•• •
•
• • •
•
0.10
•
•
•
0.08
•
•
•
•
0.06
•
• •
•
5 10 15 20
carbon
•
0.18
•
0.16
•
•
•
0.14
• •
• •
ozone
0.12
•• •
•
• • •
•
0.10
•
•
•
0.08
•
•
•
•
0.06
•
• •
•
5 10 15 20
carbon
13
n
∂f (b0, b1) X
=2 [yi − (b0 + b1xi)](−1)
∂b0 i=1
n
∂f (b0, b1) X
=2 [yi − (b0 + b1xi)](−xi).
∂b1 i=1
b0 = ȳ − b1x̄.
Substituting this value of b0 into the second
equation leads to
Pn
i=1 (xi − x̄)(yi − ȳ)
b1 = Pn 2
= β̂1.
i=1(xi − x̄)
β̂0 = ȳ − β̂1x̄.
14
Example 3: Using R to find least squares line
for Galton’s data
The R command:
summary(lm(y∼x))
gives the basic statistics for a linear regression
fit, including the least squares line.
15
The least squares line is:
y = 33.887 + 0.514x.
16
Estimation of σ 2
Y i = β 0 + β 1 x i + ǫi ,
where
σ 2 = Var(ǫi).
Since
ǫi = Yi − (β0 + β1xi),
the definition of variance tells us that
2 2
n o
σ = E [Yi − (β0 + β1xi)] .
ei = yi − ŷi, i = 1, . . . , n.
The residuals serve as proxies for the unob-
servable error terms. Define the error sum of
squares, or SSE, by
n
e2
X
SSE = i.
i=1
18
Use of residuals to check the model
19
In R, suppose we use the following commands:
fit=lm(y∼x)
resid=fit$residuals
predict=fit$fitted.values
This puts the residuals into the vector called
resid and the predicted values into the vector
predict.
30
28
26
y
24
22
20
0.0
−0.2
−0.4
−0.6
21
A “good” plot of residuals versus ŷ
2
1
residual
0
−1
−2
yhat
0
−2
−4
−6
yhat
22
Coefficient of determination
23
ŷi = β̂0 + β̂1xi = ȳ − β̂1x̄ + β̂1xi =⇒
Therefore
n
X
(yi − ŷi)(ŷi − ȳ) =
i=1
n n
2
(xi − x̄)2 =
X X
β̂1 (xi − x̄)(yi − ȳ) − β̂1
i=1 i=1
n n
β̂12 2 2
(xi − x̄)2 = 0.
X X
(xi − x̄) − β̂1
i=1 i=1
24
Define
n
(ŷi − ȳ)2,
X
SSR =
i=1
which is called the regression sum of squares.
r2 = 0.006 r2 = 0.006
30
30
20
20
10
10
0
y
y
0
−10
−10
−20
−20
−30
−30
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
r2 = 0.175 r2 = 0.175
2
1
1
0
y
−1
0
−1
−2
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
26
r2 = 0.491 r2 = 0.491
1.5
0.5
1.0
0.0
0.5
y
−0.5
0.0
−1.0
−0.5
−1.5
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
r2 = 0.908 r2 = 0.908
0.2
1.2
0.0
1.0
−0.2
0.8
−0.4
0.6
y
−0.6
0.4
−0.8
0.2
−1.0
0.0
−1.2
−0.2
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
27
75
son’s height
70
65
60
60 65 70 75
father’s height
28
Inference about β1
Properties of β̂1:
E(β̂1) = β1.
2. Var(β̂1) = σ 2/ 2.
Pn
i=1 (x i − x̄)
29
Proof of unbiasedness
Therefore,
Pn
i=1 ǫi(xi − x̄)
β̂1 = β1 + Pn 2
,
i=1(xi − x̄)
30
Consider a standardized version of β̂1:
β̂ − β1
T = qP 1 .
σ̂/ n (x − x̄)2
i=1 i
The quantity
σ̂
σ̂β̂ = qP
1 n 2
i=1(xi − x̄)
31
To test the hypothesis
H0 : β1 = β10,
use the test statistic
β̂ − β10
T = 1 .
σ̂β̂
1
32
Prediction and inference for E(Y )
E(Y ) = β0 + β1x0
We estimate this by
β̂0 + β̂1x0.
2. Var(β̂0 + β̂1x0) is
2
" #
1 (x − x̄)
σ2 + Pn 0 2
.
n i=1(xi − x̄)
34
A (1−α)100% confidence interval for β0 +β1x0
is:
Prediction intervals
Y = β0 + β1x0 + ǫ.
35
[(β0 + β1x0) − (β̂0 + β̂1x0)] =
(x0 − x̄)2
!
1
σ2 + Pn 2
+ σ2 =
n i=1(xi − x̄)
(x0 − x̄)2
!
1
σ2 1 + + Pn 2
.
n i=1(xi − x̄)
36
A (1 − α)100% prediction interval for Y given
x = x0 is:
37
Example 4: Using R to do inference for Gal-
ton’s data
38
So we’re 95% sure that the slope of the regres-
sion line is between 0.461 and 0.567. We can
also get this interval using the R commands:
fit=lm(y∼x)
confint(fit, level=0.95)
39
A description of how to solve both these prob-
lems in R is given on pg. 441-442 of your text-
book. This will be demonstrated in class. Here
are the R commands for getting these:
40
Correlation
H0 : ρ = 0
using the test statistic
√
R n−2
T = q .
1 − R2
44