Chapter 11:
SIMPLE LINEAR REGRESSION
AND
CORRELATION
Chapter outline
11.1 Empirical Models.
11.2 Simple Linear Regression.
11.3 Properties of the Least Squares Estimators.
11.4 Hypothesis Tests in Simple Linear Regression.
11.8 Correlation.
Learning Objectives
After careful study of this chapter, you should be able
to do the following:
1. Use simple linear regression for building empirical
models to engineering and scientific data.
2. Understand how the method of least squares is used
to estimate the parameters in a linear
regressionmodel.
3. Analyze residuals to determine whether the
regression model is an adequate fit to the data or
whether any underlying assumptions are violated.
Learning Objectives
4. Test statistical hypotheses and construct
confidence intervals on regression model
parameters.
5. Use the regression model to predict a future
observation and ctoonstruct an appropriate
prediction interval on the future observation.
6. Apply the correlation model.
7. Use simple transformations to achieve a linear
regression model.
11.1 Empirical Models
• Regression analysis is the process of building
mathematical models or mathematical functions
that can describe, predict or control of a variable
from one or more other variables.
• Many problems in engineering and science
involve exploring the relationships between two
or more variables.
• Regression analysis is a statistical technique
that is very useful for these types of problems.
11.1 Empirical Models
Example 1:
Suppose that a car rental company that offers
hybrid vehicles charts its revenue as shown below.
How best could we predict the company’s revenue
for the year 2016 ?
Year, x 1996 2001 2006 2011 2016
Yearly Revenue, y
(in millions of 5.2 8.9 11.7 16.8 ?
dollars)
11.1 Empirical Models
Suppose that we plot these points and try to draw a
line through them that fits. Note that there are
several ways in which this might be done. (See the
graphs below).Each would give a different estimate
of the company’s total revenue for 2016.
11.1 Empirical Models
Based on the scatter diagram, it is probably reasonable
to assume that the mean of the random variable Y is
related to x by the following straight-line relationship:
E Y | x Y |x 0 1 x
where the slope and intercept of the line are called
regression coefficients.
The simple linear regression model is given by
Y 0 1 x
where is the random error term.
11.1 Empirical Models
We think of the regression model as an empirical
model.
Suppose that the mean and variance of are 0
and 2, respectively, then
E Y | x E 0 1 x 0 1 x E
0 1 x
The variance of Y given x is
V Y | x V 0 1 x
V 0 1 x V
0 2 2
11.1 Empirical Models
The true regression model is a line of mean values:
Y |x 0 1 x
where 1 can be interpreted as the change in the mean of Y
for a unit change in x.
• Also, the variability of Y at a particular value of x is
determined by the error variance, 2.
• This implies there is a distribution of Y-values at each x
and that the variance of this distribution is the same at each
x.
11.1 Empirical Models
To determine the equation of the line that “best” fits the
data, we note that for each data point there will be a
deviation, or error, between the y-value at that point
and the y-value of the point on the linethat is directly above
or below the point.
Those deviations, in the example, y1 –5.2, y2 – 8.9,
y3 – 11.7, and y4 – 16.8, will be positive or negative,
depending on the location of the line.
11.1 Empirical Models
11.2 Simple Linear Regression
We wish to fit these data points with a line,
y 0 1 x
that uses values of 1 and 0 that, somehow, minimize
the deviations in order to have a good fit.
One way of minimizing the deviations is based on the
least-squares assumption.
11.2 Simple Linear Regression
Note that squaring each y-deviation gives us a sum of
nonnegative terms. Were we to simply add the deviations,
positive and negative deviations would cancel each
other out.
Using the least-squares assumption with the yearly
revenue data, we want to minimize.
(y1 5.2)2 (y2 8.9)2 (y3 11.7)2 (y4 16.8)2
11.2 Simple Linear Regression
Also, since the points (1, y1), (2, y2), (3, y3), and (4, y4)
must be solutions of y = β1 x + β0 , it follows that
y1 1 (1) 0 11 0
y2 1 (2) 0 2 1 0
y3 1 (3) 0 31 0
y4 1 (4) 0 4 1 0
Substituting these values for each y in the previous
equation, we now have a function of two variables.
L( 1 , 0 ) (11 0 5.2) 2 ( 2 1 0 8.9) 2
(31 0 11.7) ( 41 0 16.8)
2 2
11.2 Simple Linear Regression
Thus, to find the regression line for the given set of
data, we must find the values of β0 and β1 that minimize
the function L given by the sum above.
L L
We first find and .
0 1
L
2( 1 0 5.2) 2(2 1 0 8.9) 2(31 0 11.7)
0
2(4 1 0 16.8) 20 1 8 0 85.2
L
2 1 0 5.2 2 2 1 0 8.9 2
1
2 31 0 11.7 3 2 4 1 0 16.8 60 1 20 0 250.6
11.2 Simple Linear Regression
We set the derivatives equal to 0 and solve the resulting
system:
20 1 8 0 85.2 0
60 1 20 0 250.6 0
It can be shown that the solution to this system is
0 1.25 1 3.76
We leave it to the student to complete the D-test
to verify that (1.25, 3.76) does, in fact, yield a
minimum of L.
11.2 Simple Linear Regression
There is no need to compute L(1.25, 3.76). The values
of β1 and β0 are all we need to determine
y = β1 x + β0. The regression line is
y = 3.76x + 1.25.
The graph of this “best-fit”
regression line together
with the data points is
shown below.
Compare it to the
graphs before.
11.2 Simple Linear Regression
Now, we can use the regression equation to predict the car
rental company’s yearly revenue in 2016.
y = 3.76(5) + 1.25 = 20.05 or about $20.05 million.
The case of simple linear regression considers a single
regressor or predictor x and a dependent or response
variable Y.
The expected value of Y at each level of x is a random
variable:
E Y | x 0 1 x
We assume that each observation, Y, can be described by
the model
Y 0 1 x
11.2 Simple Linear Regression
• Suppose that we have n pairs of observations (x1, y1),
(x2, y2), …, (xn, yn).
Figure: Deviations of the data from the estimated
regression model.
11.2 Simple Linear Regression
• The method of least squares is used to estimate the
parameters, 0 and 1 by minimizing the sum of the
squares of the vertical deviations in Figure.
Figure: Deviations of the data from the estimated
regression model.
11.2 Simple Linear Regression
• The n observations in the sample can be expressed as
yi 0 1 xi i , i 1, n
• The sum of the squares of the deviations of the
observations from the true regression line is
n n
L yi 0 1 xi
2 2
i
i 1 i 1
• and
The least squares estimators of 0 and 1 , say ,
0 1
L
n
2 yi x 0
must satisfy 0 1 i
0 0 1 i 1
L
n
2 yi x x 0
0 1 i i
1 0 1 i 1
11.2 Simple Linear Regression
Simplifying these two equations yields
n n
x y
n 0 1
i 1
i
i 1
i
n n n
0 xi 1 x xi yi *
2
i
i 1 i 1 i 1
Equations (*) are called the least squares normal equations.
The solution to the normal equation results in the least
and
squares estimators .
0 1
11.2 Simple Linear Regression
Definition: The least squares estimates of the intercept
and slope in the simple linear regression model are
n n
n yi xi
i 1 i 1
xi iy
n xy nx y S xy
1 i 1
***
2 2
n
x S xx
n x
i 1
i x 2
n
i 1
x 2
i
n
n
y x where y 1 n
1
0 1 yi and x xi
n i 1 n i 1
The fitted or estimated regression line is therefore
y x
0 1
11.2 Simple Linear Regression
Note that each pair of observations satisfies the
relationship
yi x e , i 1, n
0 1 i i
where e y
i i y is called the residual. The residual
i
describes the error in the fit the model to the ith
observation yi . Later in this chapter we will use the
residuals to provide information about the adequacy of the
fitted model.
11.2 Simple Linear Regression
Example 2: Test for
significance of
regression using
the model for the
oxygen purity
data from Table.
Find simple linear
regression model
to the oxygen purity
data in Table ?
11.2 Simple Linear Regression
Solution:
The following quantities may be computed:
20 20
n 20; xi 23.92; yi 1,843.21
i 1 i 1
x 1.196; y 92.1605
20 20
i
y 2
i 1
170,044.5321; i 29.2892
x 2
i 1
20
x yi 1
i i 2,214.6566
11.2 Simple Linear Regression
Solution:
2
20
20 xi 23.92 2
S xx xi
2 i 1 29.2892 0.68088
i 1 20 20
20 20
20 yi xi
S xy xi yi i 1 i 1
i 1 20
2,214.6566
23.92 1,843.21 10.17744
20
11.2 Simple Linear Regression
Solution:
Therefore, the least squares estimates of the slope and
intercept are
S xy y x 74.28331
1 14.94748; 0 1
S xx
The fitted simple linear regression model (with the
coefficients reported to three decimal places) is
y 74.283 14.947 x
This model is plotted in Fig.11.2, along with the sample data.
11.2 Simple Linear Regression
Figure 11.2: Scatter plot of oxygen purity y versus
hydrocarbon level x and regression model
y 74.283 14.947 x
11.2 Simple Linear Regression
Practical Interpretation: Using the regression model, we
would predict oxygen purity of y 89.23%when the
hydrocarbon level is x 1% .The 89.23% purity may be
interpreted as an estimate of the true population mean
purity when x 1% , or as an estimate of a new observation
when .These xestimates
1% are, of course, subject to
error; that is, it is unlikely that a future observation on purity
would be exactly 89.23% when the hydrocarbon level is
1%. In subsequent sections, we will see how to use
confidence intervals and prediction intervals to describe the
error in estimation from a regression model.
11.2 Simple Linear Regression
Example 3: Number of Cost (1000$)
passengers
To study the relationship
61 4.28
between ticket prices and
63 4.08
number of passengers on 69 4.17
each flight, research, 11 70 4.48
commercial flights, we have 74 4.30
the following data table: 76 4.82
Find regression line of the 81 4.70
number of passengers in 86 5.11
91 5.13
term of ticket prices ?
95 5.64
y 24.53 21.67 x 97 5.56
11.2 Simple Linear Regression
Estimating 2
n n 2
The error sum of squares is E i yi
2
SS e yi
i 1 i 1
SS E
An unbiased estimator of is
22
**
n 2
where SS E can be easily computed using
n n
S ; SS y y y 2 n y 2
i i
2
SS E SST 1 xy T
i 1 i 1
11.3 Properties of the Least Squares
Estimators
2
•
Slope Properties: E 1 1 ; V 1 S xx
• Intercept Properties
2
; V
E 0 0
0
2 1 x
n S xx
In simple linear regression the estimated standard error of
the slope and intercept are
ˆ 2 1 x 2
se ˆ1
S xx
ˆ
and se 0 ˆ
2
n S
xx
respectively, where is computed from
2
**
11.4 Hypothesis Tests in Simple Linear
Regression
11.4.1. Use of t-Tests:
Suppose we wish An appropriate test statistic would be Reject the null
to test hypothesis
1 1,0 1 1,0
or t
H 0 : 1 1,0 t0 2 t0 t /2,n 2
/S
0
se 1
H1 : 1 1,0 xx
0 0,0 0 0,0
H 0 : 0 0,0 t0
H1 : 0 0,0
2 1
x
2
se 0 t0 t /2,n 2
n S xx
11.4 Hypothesis Tests in Simple Linear
Regression
An important special case of the hypotheses of
Equation *** is
H 0 : 1 0
H1 : 1 0
These hypotheses relate to the significance of
regression.
Failure to reject H0 is equivalent to concluding
that there is no linear relationship between x and Y.
11.4 Hypothesis Tests in Simple Linear
Regression
Figure 1:
The hypothesis
H 0 : 1 0
is not rejected.
Figure 2:
The hypothesis
H 0 : 1 0
is rejected.
11.4 Hypothesis Tests in Simple Linear
Regression
Example 4: Test for
significance of
regression using
the model
for the oxygen
purity data from
table
a/1 at 0.01
b/ 0 at 0.01
11.4 Hypothesis Tests in Simple Linear
Regression
Solution:
H 0 : 1 0
a/ The hypotheses are
H1 : 1 0
2
We have 1 14.947; n 20; S xx 0.68088, 1.18
Test statistic
1 1 14.947
t0 11.35
2
/ S xx
se
1
1.18 / 0.68088
Because t0 11.35 t0.005,18 2.88, so we reject H 0
11.4 Hypothesis Tests in Simple Linear
Regression
Solution:
H 0 : 0 0
b/ The hypotheses are
H1 : 0 0
Test statistic t0 46.62
Because t0 46.62 t0.005,18 2.88 , so we reject H 0
11.4 Hypothesis Tests in Simple Linear
Regression
11.4.2. Analysis of variance approach to test significance of regression.
Suppose we wish to test
H 0 : 1 0
H1 : 1 0
Test for Significance of Regression
SS R / 1 MS R
Where F0
SS E / n 2 MS E
n 2 n 2
SS R
yi y ; SS E yi
yi
Reject if
i 1 i 1
Where is ndistribution (see Appendices VI)
2
SST yi y SS R SS E
i 1
H 0 F0 F ,1,n 2
F ,1,n 2 F
11.4 Hypothesis Tests in Simple Linear
Regression
ANOVA table
Source of Sum of Degrees Mean
F0
Variation Squares of Square
Freedom
Regression S SSR
SS R 1 xy 1 MS R
1 MS R
F0
Error SSE MS E
SS E n 2 MS E
n 2
Total SST n 1
11.4 Hypothesis Tests in Simple Linear
Regression
Example 5: We will use the analysis of variance approach to test for
significance of regression using the oxygen purity data model from Example 2.
Recall that
and
173.38;
SSTregression
The of
sum 14.947;
squares is S xy 10.17744, n 20
1
and the error sum of squares is
SS R S 14.947 10.17744 152.13
1 xy
The test statistic is ,for which we find
SS E SST SS R 21.25
that the , soMS
we Rconclude that
is not zero. F0 128.86
MS E
P value 1.23 10 9 1
11.4 Hypothesis Tests in Simple Linear
Regression
Note that:
-The analysis of variance procedure for testing for
significance of regression is equivalent to the t-test.
That is, either procedure will lead to the same
conclusions.
-The t-test is somewhat more flexible in that it would
allow testing against a one-sided alternative
hypothesis, while the F-test is restricted to a two-
sided alternative.
11.8 Correlation
We assume that the joint distribution of X i and Yi is the
bivariate normal distribution presented in Chapter 5, and
Y and Y2 are the mean and variance of Y, X and X2 are
the mean and variance of X and is the correlation
coefficient between Y and X . Recall that the correlation
coefficient is defined as
XY
where XY is the covariance between Y and X
X Y
The condition distribution of Y for a given value of X x
is 1 y 0 1x
2
1
2 Y | x
Y Y
fY | x y e
; 0 Y X ; 1
Y |x 2 X X
11.8 Correlation
It is possible to draw inferences about the correlation
coefficient in this model. The estimator of is the sample
correlation coefficient
n
Y Y X
i i X S XY
R i 1
n 2 n S XX .SST
X Y Y
2
i X i
i 1 i 1
SST
Note that 1 R
S XX
S S
SS R
2
We may also write: R 1
2 XX
1 XY
SYY SST SST
11.8 Correlation
Properties:
1 R 1
R 0 :positive correlation.
R 0 :negative correlation
R 0 :no correlation
11.8 Correlation
Case 1:
It is often useful to test the hypotheses
H 0 : 0 (there is no relationship)
H1 : 0 (there is a relationship)
Test
R n 2
statistic t0
1 R2
Reject H0 if t0 t /2,n 2
11.8 Correlation
Case 2:
It is often useful to test the hypotheses
H 0 : 0
H1 : 0
Teststatistic
z0 arctanh R arctanh 0 n 3
where
u u
e e
tanh u u
e e u
Reject H0 if
z0 z /2
11.8 Correlation
The approximate 100(1- )% confidence interval
is
z /2 z /2
tanh arctan hr tanh arctan hr
n 3 n 3
Example: Use the given data to find the equation of the
regression line and the value of the linear correlation
coefficient r and test the hypothesis that 0 using 0.05
and 0.1
a/
x 2 4 5 6
y 7 11 13 20
b/ Cost 9 2 3 4 2 5 9 10
Number 85 52 55 68 67 86 83 73