[go: up one dir, main page]

0% found this document useful (0 votes)
43 views43 pages

1 - Simple Linear Regression

Uploaded by

mrdeltahacker4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views43 pages

1 - Simple Linear Regression

Uploaded by

mrdeltahacker4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Simple Linear Regression

and Correlation

Regression: Methodology for studying the re-


lationship between two sets of variables.

Simplest case is where we have only two vari-


ables, say x and y.

How does y change when x increases or de-


creases?

x y

Animal age Animal weight


Height Weight
Age Blood pressure
Temperature Chemical reaction time

2
Example 1

Galileo discovered that gravity accelerates all


objects at the same rate. Drop a bowling ball
and a marble together from the top of a build-
ing and they will reach the ground at the same
time.

time it takes object


x = height y = to reach ground when
dropped from height x

Let g be the gravitational constant. Isaac New-


ton argued that
s
2x
y= .
g

Under ‘ideal’ conditions, this equation is exact.

3
Suppose we didn’t know about the equation
and wanted to establish the relationship be-
tween x and y by experimentation.

How the experiment might be done:

• Choose several values of x, say n of them,


and call them x1, . . . , xn.

• For each height xi, drop a ball three times


from height xi and record the times it took
the ball to reach the ground.

• Fit a curve through the 3n data points.


This curve estimates the functional rela-
tionship between x and y.

4
Scatterplot for gravity experiment

6
5
4
time

3
2
1
0

5 10 15 20

height

The curve fitted through the points estimates


the true relationship between height (x) and
time (y).

Note: Differences between times at the same


height are due completely to (random) errors
(or ‘noise’) made in recording the times.

5
Example 2

Sir Francis Galton (1822-1911), cousin of Char-


les Darwin, was an English scientist who, among
many other things, studied inheritance. (He
coined the term “nature vs. nurture.”)

Galton was an early pioneer in the field of


statistics. He is credited with some of the ear-
liest discoveries in the areas of regression and
correlation.

Galton collected data on the heights of fathers


and their sons. A famous data set of his on
the heights of 1,078 pairs of fathers and sons
is shown on the next page.

6
Galton’s father-son height data
75
son’s height

70
65
60

60 65 70 75

father’s height

The red line denotes the average height of all


sons, blue line denotes the average son’s height
as a function of father’s height, and the black
line simply denotes the line y = x + 1.
7
Note: The average height of all the sons is
about one inch higher than that of the fathers.

The line y = x + 1 is interesting because intu-


itively we might expect that a group of fathers
all of whom have height x would have sons
whose heights average x + 1.

However, this isn’t true. Sons’ heights regress


towards the overall mean. This is illustrated by
the fact that the slope of the line of averages
is less than 1.

Consider the heights of sons whose fathers are


all the same height. Obviously the sons heights
won’t all be the same (so there is still ‘noise’).

In contrast to Example 1, the differences be-


tween the heights of sons is not due solely to
errors made in recording the sons’ heights.

8
Initially we study a fairly simple situation in
which the relation between y and x is ≈ linear:

y ≈ ax + b.

Statistical model

x1, . . . , xn are values of a control or independent


or explanatory variable (also called covariate or
predictor or regressor or feature etc.)

These x values are fixed by the experimenter.


(Note: these may be random variables but are
considered “conditioned”, i.e. fixed, here.)

Y1, . . . , Yn are corresponding values of a response


or outcome or dependent variable. These are
random variables (even though the respective
xi is fixed!), as indicated by upper case Y ’s.

We assume a simple linear regression model:

Y i = β 0 + β 1 x i + ǫi , i = 1, . . . , n.
9
The following assumptions are made about the
preceding model:

(1) ǫ1, . . . , ǫn are unobserved random variables.


Note: they are also called errors or ‘noise’.

(2) ǫ1, . . . , ǫn are independent of each other.

(3) Each ǫi has a N (0, σ 2) distribution. The


variance σ 2 stays the same for all ǫi’s. (Also
called the ‘homoskedasticity assumption’).

(4) β0, β1 and σ 2 are unknown, but fixed, pa-


rameters. β0 is called the intercept, β1 the
slope and σ 2 the error (or ‘noise’) variance.

The statistical problem:

Infer on the unknown parameters β0, β1 and


σ 2 using the data (x1, y1), (x2, y2), . . . , (xn, yn).
10
Note that, according to the model,
E(Yi) = β0 + β1xi + E(ǫi ) = β0 + β1xi,
and Var(Yi) = σ 2. Here E(·) and Var(·) respec-
tively denote the expectation (i.e. average or
mean) and variance of Y given the predictor.

Interpretation: The “expected” (i.e. average)


value of Y in the sub-population where the pre-
dictor equals x is a linear function of x.

The parameter σ 2 determines how tightly the


data will cluster about the line y = β0 + β1x.
If σ 2 = 0, the relationship is deterministic.

An analysis of the relationship between x and


y is referred to as a regression analysis.

Interpretation of slope parameter: β1 repre-


sents the change in the average response when
the independent variable increases by 1 unit.

Example: if β1 = −5, this means the average


response decreases by 5 when x goes up 1 unit.
11
Given a set of data, how do we estimate the
parameters β0 and β1? One method of doing
so is based on the principle of least squares.

Given the data (x1, y1), (x2, y2), . . . , (xn, yn), find
the values of b0 and b1 such that
n
[yi − (b0 + b1xi)]2
X
f (b0 , b1) =
i=1
is minimized.

Carl Friedrich Gauss (1777-1855) is credited


with discovering least squares.

• Take partial derivatives of f wrt b0 and b1.

• Set derivatives to 0 and solve for b0 and b1.

12
Illustration of the Least Squares Principle


0.18


0.16




0.14

• •
• •
ozone
0.12

•• •

• • •

0.10




0.08





0.06


• •

5 10 15 20
carbon


0.18


0.16




0.14

• •
• •
ozone
0.12

•• •

• • •

0.10




0.08





0.06


• •

5 10 15 20
carbon

Two candidate lines and the vertical


deviations of the data from each line

13
n
∂f (b0, b1) X
=2 [yi − (b0 + b1xi)](−1)
∂b0 i=1
n
∂f (b0, b1) X
=2 [yi − (b0 + b1xi)](−xi).
∂b1 i=1

Setting the first expression equal to 0 and solv-


ing for b0 yields

b0 = ȳ − b1x̄.
Substituting this value of b0 into the second
equation leads to
Pn
i=1 (xi − x̄)(yi − ȳ)
b1 = Pn 2
= β̂1.
i=1(xi − x̄)

This implies that the best b0 is

β̂0 = ȳ − β̂1x̄.

β̂0 and β̂1 are called the least squares estimates


of β0 and β1.

14
Example 3: Using R to find least squares line
for Galton’s data

Suppose the response and independent vari-


ables are called y and x, respectively, in your R
session.

The R command:

summary(lm(y∼x))
gives the basic statistics for a linear regression
fit, including the least squares line.

The estimated intercept and slope are:

β̂0 = 33.887 and β̂1 = 0.514.

15
The least squares line is:

y = 33.887 + 0.514x.

For the population of father-son pairs from


which these data were collected, it is estimated
that when father’s height increases by 1 in., the
son’s height increases, on average, by 0.514 in.

16
Estimation of σ 2

In our model σ 2 is the other unknown param-


eter (besides β0 and β1). Our model says that

Y i = β 0 + β 1 x i + ǫi ,
where
σ 2 = Var(ǫi).
Since
ǫi = Yi − (β0 + β1xi),
the definition of variance tells us that
2 2
n o
σ = E [Yi − (β0 + β1xi)] .

It would thus make sense to estimate σ 2 by


something like
n h
1 X i2
yi − (β̂0 + β̂1xi) .
n i=1

Define ŷi = β̂0 + β̂1xi, i = 1, . . . , n. These are


called the predicted values.
17
The so-called residuals are

ei = yi − ŷi, i = 1, . . . , n.
The residuals serve as proxies for the unob-
servable error terms. Define the error sum of
squares, or SSE, by
n
e2
X
SSE = i.
i=1

Our estimate of σ 2 will be:


SSE
σ̂ 2 = .
n−2

Dividing by n − 2, rather than n, makes σ̂ 2 an


unbiased estimator. (Recall the case of sample
variance where you need to divide by n − 1.)

Here we subtract 2, rather than 1, from n since


we have had to estimate two parameters, β0
and β1, in order to construct the variance es-
timator σ̂ 2.

18
Use of residuals to check the model

The residuals are a main tool in checking to


see whether or not our model assumptions are
correct. If the model is correct, then e1, . . . , en
should behave very much like a random sample
from a normal distribution.

Plotting the residuals in various ways allows us


to check the assumptions.

• Plotting ei vs. xi checks on whether the


regression curve is more complicated than
just a straight line.

• Plotting ei vs. ŷi is a good way to check


the “equal variances” assumption.

• A histogram, kernel density estimate or nor-


mal quantile plot of the residuals allow us
to check the normality assumption.

19
In R, suppose we use the following commands:
fit=lm(y∼x)
resid=fit$residuals
predict=fit$fitted.values
This puts the residuals into the vector called
resid and the predicted values into the vector
predict.

We can then produce the various plots on the


previous page. The first two plots can be pro-
duced via the plot () command, as follows.

plot(x, resid, xlab ="x", ylab ="Residuals")


plot(predict, resid, xlab ="yhat", ylab ="...")

The third set of plots can be produced using:


hist(resid) for histogram, plot(density(x)) for
the density estimator and qqnorm(resid) for the
normal quantile plot or the q-q plot which, ide-
ally, should closely resemble the y = x line.

Here are a few examples of the first two type


of plots in the next two slides.
20
Scatterplot of y versus x

30
28
26
y

24
22
20

0.0 0.2 0.4 0.6 0.8 1.0

Plot of residuals from fitted


straight line versus x
0.6
0.4
0.2
residual

0.0
−0.2
−0.4
−0.6

0.0 0.2 0.4 0.6 0.8 1.0

Note the pattern in the residuals.

21
A “good” plot of residuals versus ŷ

2
1
residual

0
−1
−2

0.00 0.05 0.10

yhat

Plot of residuals versus ŷ in which


the variance increases with ŷ
6
4
2
residual

0
−2
−4
−6

−0.2 0.0 0.2 0.4 0.6 0.8

yhat

22
Coefficient of determination

Define the total sum of squares, or SST , by


n
(yi − ȳ)2.
X
SST =
i=1

This measures all the variation in yis, from


whatever source.

ANOVA decomposition of SST


n
(yi − ŷi + ŷi − ȳ)2
X
SST =
i=1
n
(ŷi − ȳ)2
X
= SSE +
i=1
n
X
+2 (yi − ŷi)(ŷi − ȳ).
i=1

Now we will argue that the very last term on


the right-hand side above is 0.

23
ŷi = β̂0 + β̂1xi = ȳ − β̂1x̄ + β̂1xi =⇒

ŷi − ȳ = β̂1(xi − x̄)


and
yi − ŷi = yi − ȳ − β̂1(xi − x̄)

Therefore
n
X
(yi − ŷi)(ŷi − ȳ) =
i=1
n n
2
(xi − x̄)2 =
X X
β̂1 (xi − x̄)(yi − ȳ) − β̂1
i=1 i=1
n n
β̂12 2 2
(xi − x̄)2 = 0.
X X
(xi − x̄) − β̂1
i=1 i=1

We’ve proven that


n
(ŷi − ȳ)2.
X
SST = SSE +
i=1

24
Define
n
(ŷi − ȳ)2,
X
SSR =
i=1
which is called the regression sum of squares.

SSR: amount of variation in y that is systematic


and can be explained via the linear model,
i.e. the linear relationship between y and x.

SSE: amount of variation in y that is unsystematic


i.e. not explainable through x and is purely
due to random error or noise unrelated to x.

A nice summary measure is the coefficient of


determination, given by the ratio:
2 SSR SSE
R = =1− .
SST SST

It’s always true that 0 ≤ R2 ≤ 1.

R2 is the fraction of the total variation in y


that is due to (or explainable via) the straight
line relationship between y and x.
25
Scatterplots and associated values of R2

r2 = 0.006 r2 = 0.006

30
30

20
20

10
10

0
y

y
0

−10
−10

−20
−20

−30
−30

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x x

r2 = 0.175 r2 = 0.175
2

1
1

0
y

−1
0
−1

−2

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x x

26
r2 = 0.491 r2 = 0.491
1.5

0.5
1.0

0.0
0.5
y

−0.5
0.0

−1.0
−0.5

−1.5

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x x

r2 = 0.908 r2 = 0.908
0.2
1.2

0.0
1.0

−0.2
0.8

−0.4
0.6
y

−0.6
0.4

−0.8
0.2

−1.0
0.0

−1.2
−0.2

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x x

27
75
son’s height

70
65
60

60 65 70 75

father’s height

R2 for Galton’s data (Example 2) is 0.251. So,


only about 25% of the total variation in sons’
heights is explained by the linear relationship.

28
Inference about β1

Usually we’re not satisfied with just having point


estimates of parameters. We’d like to be able
to say about how far off the point estimate
could be. To this end we can construct a con-
fidence interval and/or test a hypothesis.

First of all we consider the distribution of β̂1.


Pn
i=1 (xi − x̄)(Yi − Ȳ )
Recall that: β̂1 = Pn 2
.
(x
i=1 i − x̄)

Properties of β̂1:

1. β̂1 is an unbiased estimator of β1, i.e.

E(β̂1) = β1.

2. Var(β̂1) = σ 2/ 2.
Pn
i=1 (x i − x̄)

3. β̂1 is normally distributed.

29
Proof of unbiasedness

The numerator of β̂1 is


n
X n
X n
X
(xi − x̄)Yi − Ȳ (xi − x̄) = (xi − x̄)Yi.
i=1 i=1 i=1

Our model says the last quantity is


n
X
(xi − x̄)(β0 + β1xi + ǫi) =
i=1
Pn Pn
β0 (x
i=1 i − x̄) + β 1 i=1 xi (xi − x̄)+
Pn
i=1 ǫi (xi − x̄) =
n n
(xi − x̄)2 +
X X
β1 ǫi(xi − x̄).
i=1 i=1

Therefore,
Pn
i=1 ǫi(xi − x̄)
β̂1 = β1 + Pn 2
,
i=1(xi − x̄)

and so E(β̂1) = β1. Why?

30
Consider a standardized version of β̂1:
β̂ − β1
T = qP 1 .
σ̂/ n (x − x̄)2
i=1 i

The quantity
σ̂
σ̂β̂ = qP
1 n 2
i=1(xi − x̄)

estimates the standard error of β̂1. The proba-


bility distribution of T is the t-distribution with
n − 2 degrees of freedom.

The last fact leads to confidence intervals and


tests for β1.

A (1 − α)100% confidence interval for β1 is

β̂1 ± tn−2;α/2 σ̂β̂ .


1

Here tn−2;α/2 denotes (1 − α/2)th quantile of


the tn−2 distribution.

31
To test the hypothesis

H0 : β1 = β10,
use the test statistic
β̂ − β10
T = 1 .
σ̂β̂
1

The alternative could be any of Ha : β1 6= β10,


Ha : β1 > β10 or Ha : β1 < β10.

The test is done exactly like the t-test for a


population mean except using n − 2 instead of
n − 1 degrees of freedom. Rejection regions?

For two-sided Ha , reject H0 at level α if |T | >


tn−2;α/2. For a one-sided Ha : β1 > β10, or
Ha : β1 < β10, reject H0 at level α if T > tn−2;α,
or if T < −tn−2;α, respectively.

Of particular interest is testing H0 : β1 = 0.


Why?

32
Prediction and inference for E(Y )

x = father’s height y = son’s height

Two distinct problems:

• A six foot tall father would like to predict


the height of his unborn son Zach.

• Sir Francis Galton would like to estimate


the average height of sons whose fathers
are six feet tall.

The first problem, called prediction is subject


to more uncertainty than the second.

Suppose we know the average in the second


problem. We would probably use this aver-
age as our prediction of Zach’s height, but
we wouldn’t really expect this prediction to be
right on the mark.
33
Inference for the mean of Y at a given x = x0

E(Y ) = β0 + β1x0

We estimate this by

β̂0 + β̂1x0.

Properties of the estimator:

1. β̂0 + β̂1x0 is an unbiased estimator of β0 +


β 1 x0 .

2. Var(β̂0 + β̂1x0) is
2
" #
1 (x − x̄)
σ2 + Pn 0 2
.
n i=1(xi − x̄)

3. β̂0 + β̂1x0 is normally distributed.

34
A (1−α)100% confidence interval for β0 +β1x0
is:

(β̂0 + β̂1x0) ± tn−2;α/2SE(x0),


where
#1/2
2
"
1 (x0 − x̄)
SE(x0) = σ̂ + Pn 2
.
n (x
i=1 i − x̄)

Prediction intervals

Predict the value of Y for a subject whose x-


value is x0. Our model says

Y = β0 + β1x0 + ǫ.

Suppose we guess Y to be β̂0 + β̂1x0. Then,


the total error we committed in the process is:

β0 + β1x0 + ǫ − (β̂0 + β̂1x0) =

[(β0 + β1x0) − (β̂0 + β̂1x0)] + ǫ.

35
[(β0 + β1x0) − (β̂0 + β̂1x0)] =

error due to estimating β0 and β1

ǫ =deviation of Y from β0 + β1x0

Suppose ǫ is independent of the data from


which β̂0 and β̂1 are computed. Then the vari-
ance of the prediction error is

Var(β̂0 + β̂1x0) + Var(ǫ) =

(x0 − x̄)2
!
1
σ2 + Pn 2
+ σ2 =
n i=1(xi − x̄)

(x0 − x̄)2
!
1
σ2 1 + + Pn 2
.
n i=1(xi − x̄)

36
A (1 − α)100% prediction interval for Y given
x = x0 is:

(β̂0 + β̂1x0) ± tn−2;α/2SEpred(x0),


where
!1/2
1 (x0 − x̄)2
SEpred(x0) = σ̂ 1 + + Pn 2
.
n (x
i=1 i − x̄)

What is the correct interpretation of the pre-


diction interval?

Suppose, for example, that α = 0.05. Then


we are 95% confident that the future Y value
(corresponding to x0) will be in the prediction
interval.

Note: The prediction interval for Y at x will be


wider than the corresponding confidence inter-
val for the mean of Y at x, for the same value
of x. Makes sense (why?).

37
Example 4: Using R to do inference for Gal-
ton’s data

The R command summary(lm(y∼x)) provides β̂1


and its estimated standard error:

β̂1 = 0.514 and σ̂β̂ = 0.027.


1

The T for testing H0 : β1 = 0 is 19.01, and


the associated P -value (< 2 · 10−16) is for the
two-sided alternative.

So, there is strong evidence that the heights


of fathers and their sons are positively related.

A 95% confidence interval for β1 is:

0.514 ± 1.96(0.027) = 0.514 ± 0.05292,


or (0.461, 0.567).

38
So we’re 95% sure that the slope of the regres-
sion line is between 0.461 and 0.567. We can
also get this interval using the R commands:

fit=lm(y∼x)
confint(fit, level=0.95)

Now we do the following:

• Find a 95% confidence interval for the av-


erage height of sons whose fathers are all
5 feet 8 inches tall.

• Predict Zach’s height using an interval in


which you have 98% confidence. (Zach’s
father is six feet tall.)

In the first of these two problems x0 = 68, and


in the second x0 = 72.

39
A description of how to solve both these prob-
lems in R is given on pg. 441-442 of your text-
book. This will be demonstrated in class. Here
are the R commands for getting these:

predict(lm(y∼x), newdata=data.frame(x = 68),


interval="confidence", level=0.95)

The 95% confidence interval for the average is


(68.70,68.99). So, we are 95% confident that
the average height of sons whose fathers are
5’ 8” tall is between 68.70 and 68.99 inches.

predict(lm(y∼x), newdata=data.frame(x = 72),


interval="prediction", level=0.98)

The 98% prediction interval for Zach’s height


is (65.22,76.59). So, we’re 98% confident that
Zach will be between 65.22 and 76.59 in. tall.

It’s good to keep in mind that, at best, these


inferences apply to a population of father-son
pairs living in Galton’s time.

40
Correlation

In some cases, we want to know how strongly


two variables are related without really identi-
fying which of the two variables is the response
and which is the predictor.

The (population) correlation coefficient, called


ρ, is discussed in Section 4.5.2 of your text.

Given two random variables X and Y with some


joint distribution and means µX and µY ,
Cov(X, Y )
ρ ≡ Corr(X, Y ) = , where
σX σY
2
σX = Var(X), σY2 = Var(Y ) and

Cov(X, Y ) = E[(X − µX )(Y − µY )]

Given data, we can estimate ρ. Suppose that


(X1, Y1), . . . , (Xn, Yn) are independent and iden-
tically distributed (i.i.d.) pairs of realizations
of the random variables (X, Y ).
41
Real situation where this model holds:

Have a population of individuals. Each in-


dividual has an x and a y measurement.
Take a random sample of individuals.

How can we estimate ρ = Corr(Xi , Yi)?

Estimate Cov(Xi, Yi) by


n
1 X
(Xi − X̄)(Yi − Ȳ ),
n i=1
2 by
σX
n
2 1
(Xi − X̄)2
X
σ̂X =
n i=1
and σY2 by
n
1
σ̂Y2 = (Yi − Ȳ )2.
X
n i=1

Now substitute these estimators for the param-


eters in the defn. of ρ to get an estimator ρ̂.
42
This yields
Pn
ρ̂ =  i=1(Xi − X̄)(Yi − Ȳ )
Pn 1/2 = R.
2 n (Y − Ȳ )2
i=1(Xi − X̄)
P
i=1 i

This is called Pearson’s product moment cor-


relation coefficient, or simply the sample cor-
relation coefficient.

The reason for calling ρ̂ also as R is that ρ̂2


also equals R2, the coefficient of determination
discussed on pg. 25N (i.e., pg. 25 of the notes).

R (or equivalently, ρ̂) provides a point estimate


of the ρ. The closer R is to ±1, the stronger
the relationship is between Xi and Yi. If R ≈ 0,
then it’s safe to say that ρ is close to 0.

Note also that the slope estimator β̂1 = R σ̂σ̂Y .


X

Unfortunately, ρ = 0 does not necessarily im-


ply that Xi and Yi are independent. (Review
Section 4.5.2 of textbook)
43
We may test the null hypothesis:

H0 : ρ = 0
using the test statistic

R n−2
T = q .
1 − R2

When H0 is true and n is large, T has an


approximate standard normal distribution, i.e.
N (0, 1). Tests are then done in the usual way.

The R command cor.test(x,y) provides the


correlation coefficient between x and y, a 95%
confidence interval for ρ and a two-sided test
of H0 : ρ = 0 (at level 0.05).

44

You might also like