Unit 2
Probability Distributions
Lesson 2
Linear Regression
Lesson 2: Linear Regression
Correlation and Prediction
In the previous lesson, we learned to measure the strength of the
linear relationship between two variables with the correlation
coefficient r.
When there is a strong linear relationship two variables, we can use
the value of the predictor variable to estimate the value of the
response variable.
Unit 2: Probability Distributions
Lesson 2: Linear Regression
A Motivating Example
A person takes in more oxygen when exercising than when at
rest. The oxygen is supplied to the muscles by the heart, which
must beat faster as the exercise level is increased.
Suppose that we wish to determine the oxygen uptake of subjects
at various levels of activity.
Measuring oxygen uptake directly requires the use of
specialized and costly equipment in a lab environment.
Measuring a persons heart rate is simple, inexpensive, and
convenient.
If a persons oxygen uptake can be predicted accurately from the
heart rate, we may be able to use the predicted uptake values
instead of direct measurements for our research purposes.
t z
f
Unit 2: Probability Distributions
Lesson 2: Linear Regression
Oxygen Uptake Data
Suppose the heart rate (HR) and oxygen uptake
(VO2) for a subject exercising on a treadmill
were recorded during a 20-minute workout, and
the following data were recorded:
The correlation coefficient,
r = .986
indicates a strong, positive, linear relationship
between heart rate and oxygen uptake.
Unit 2: Probability Distributions
Time
HR
VO2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
96
95
95
94
95
94
104
104
106
109
108
110
113
113
118
115
121
127
131
130
.753
.929
.939
.832
.983
1.049
1.178
1.176
1.292
1.379
1.403
1.499
1.529
1.599
1.749
1.746
1.897
2.040
2.231
2.301
Lesson 2: Linear Regression
Scatter Plot
From the scatter plot, observe:
the sample points do not fall
on a single line, but they do
appear to be scattered about a
central line, the line of best fit.
Oxygen Uptake
the trend in the datathe
oxygen uptake increases as the
heart rate increases.
2.5
2.0
1.5
1.0
0.5
0.0
90
100
110
120
130
Heart Rate
Scatter plot of oxygen uptake vs.
heart rate data
We can use the line of best fit to estimate the oxygen uptake from the
measured heart rate.
For example, at a heart rate of 100 beats per minute, the predicted
oxygen uptake is approximately 1.1 (units?)
Unit 2: Probability Distributions
Lesson 2: Linear Regression
Example 1: Fitting a Line to a Bivariate Data Set
P 216, #3. For the following data set:
(a) Draw a scatter diagram
treating x as the predictor
variable and y as the response
variable.
15 14
15
10
y
5
0
2
Unit 2: Probability Distributions
Lesson 2: Linear Regression
Example 1: Fitting a Line to a Bivariate Data Set
P 216, #3. For the following data set:
(b) Select two points from the
scatter diagram and find an
equation of the line (in the
form y = mx + b) containing
the points.
15 14
15
y = 2x 2
(8, 14)
10
y
5
(3, 4)
Take the points (3, 4) and
0
2
4
6
(8, 14).
x
14
4
The slope is: m
8 3 = 2
Substitute m = 2, x = 3, and y = 4 to obtain the y-intercept:
4 = (2)(3) + b 4 = 6 + b b = 2 y = 2x 2
Unit 2: Probability Distributions
Lesson 2: Linear Regression
Errors and Residuals
A predicted value of y is denoted with the symbol (y-hat).
The difference between the actual value of y and the predicted value
of y is the error or residual. That is
error = observed y value predicted y value = y
For any predicted value , the squared error is (y )2
For any given line of fit, the sum of the squared errors is
SSE = (y )2
The least squares regression line is the line of fit that minimizes the
sum of the squared errors.
Unit 2: Probability Distributions
Lesson 2: Linear Regression
Example 2: Fitting a Line to a Bivariate Data Set
P 216, #3. For the following data set:
(c) Compute the sum of the
squared errors for the line
= 2x 2 .
15 14
12 14
1 1
(y )2 0
15
= 2x 2
(8, 14)
SSE =
10
(y )
y
5
(3, 4)
0
2
Unit 2: Probability Distributions
1
2
= 11
Lesson 2: Linear Regression
Equation of the Least Squares Regression Line
The equation of the least-squares regression line is given by
= b1x + b0
where
Sxy
the slope of the least-squares line is b1 S
xx
Recall
Sxx (xi x )2
xi2
Unit 2: Probability Distributions
Sxy (xi x )( yi y)
x y
i
xi yi
Lesson 2: Linear Regression
Equation of the Least Squares Regression Line
The equation of the least-squares regression line is given by
= b1x + b0
where
the intercept of the least-squares line is b0 y b1 x
Note: The value predicts the mean value of the response variable y
at the for a specific value of x.
The graph of the least-squares equation is also known as the
line of means of the data.
Unit 2: Probability Distributions
Lesson 2: Linear Regression
Example 3: Finding the Least Squares Regression Line
P 216, #3. For the following data set:
(d) Find the equation of the least
squares regression line.
15
14
x2
16
25
49
64
xy
12
20
35 105 112
x = 27 y = 45 x = 5.4 y = 9 x2 = 163 xy = 284
S 163
xx
272
17.2
S 284
(45)(27)
41
xy
5
Sxy
41 = 2.38
The slope of the least-squares line is b1
Sxx 17.2
The intercept of the least-squares line is
5
b0 y b1 x = 9 (2.38)(5.4) = 3.87
Unit 2: Probability Distributions
Thus, = 2.38x 3.87
t
Lesson 2: Linear Regression
Example 3: Finding the Least Squares Regression Line
P 216, #3. For the following data set:
(e) Graph the least squares
regression line on the scatter
diagram.
15
14
15
= 2.38x 3.85
10
y
5
0
2
Unit 2: Probability Distributions
Lesson 2: Linear Regression
Example 3: Finding the Least Squares Regression Line
P 216, #3. For the following data set:
(e) Graph the least squares
regression line on the scatter
diagram.
15
14
3.3 5.7 8.1 12.8 15.2
.7
.7 1.1 2.2 1.2
(y )2 .49 .49 1.21 4.84 1.44
15
= 2.38x 3.85
(f) Compute the sum of the
squared errors for the
regression line.
10
y
5
SSE =
0
2
Unit 2: Probability Distributions
(y )
= 8.47
Note that this SSE is lower than
the first line, i.e. the fit is
best for the least squares line.
t
Lesson 2: Linear Regression
Example 4: Weight vs. Mileage Rating
Weight
(pounds)
Miles Per
Gallon
3565
19
3440
20
3970
17
(a) Find the least squares regression line treating
weight as the predictor variable (x) and
mileage as the response variable (y).
3305
19
3340
20
3200
20
3230
19
Using the TI-83, the equation of the leastsquares line is:
2560
28
2520
28
3065
20
= .0073x + 44.3
3600
18
3300
19
3625
19
3590
19
2605
23
2370
28
P 218, #14. The data represent the weight of
various domestic cars and their city mileage
rating (in mpg) for the 2001 model year.
Unit 2: Probability Distributions
Lesson 2: Linear Regression
Example 4: Weight vs. Mileage Rating
P 218, #14. The data represent the weight of
various domestic cars and their city mileage
rating (in mpg) for the 2001 model year.
(b) Interpret the slope and intercept, if possible.
The slope m = 0.0073 means that the
mileage is reduced by an average of 0.0073
mpg for a one pound increase in the weight
of the car.
Since a weight of x = 0 lbs is not possible,
there is no meaningful interpretation of the
intercept.
Unit 2: Probability Distributions
Weight
(pounds)
Miles Per
Gallon
3565
19
3440
20
3970
17
3305
19
3340
20
3200
20
3230
19
2560
28
2520
28
3065
20
3600
18
3300
19
3625
19
3590
19
2605
23
2370
28
Lesson 2: Linear Regression
Example 4: Weight vs. Mileage Rating
Weight
(pounds)
Miles Per
Gallon
3565
19
3440
20
3970
17
3305
19
3340
20
3200
20
3230
19
2560
28
2520
28
The residual error is +1 mpg
3065
20
3600
18
Is the mileage of an Aurora above or below
average for cars of this weight?
3300
19
3625
19
3590
19
2605
23
2370
28
P 218, #14. The data represent the weight of
various domestic cars and their city mileage
rating (in mpg) for the 2001 model year.
(c) Predict the mileage of an Oldsmobile Aurora
(3625 lbs) and compute the residual error.
The predicted mileage is
= .0073(3625) + 44.3 = 18 mpg
Since the residual is positive, the Aurora is
above average for cars of its weight.
t z
Unit 2: Probability Distributions
Lesson 2: Linear Regression
Example 4: Weight vs. Mileage Rating
P 218, #14. The data represent the weight of
various domestic cars and their city mileage
rating (in mpg) for the 2001 model year.
City Mileage (MPG)
(d) Draw the least-squares regression line on the
scatter diagram of the data and label the
residual.
Weight vs. Mileage
30
25
20
Residual
15
2000
2500
3000
3500
4000
Weight (lbs)
Unit 2: Probability Distributions
Weight
(pounds)
Miles Per
Gallon
3565
19
3440
20
3970
17
3305
19
3340
20
3200
20
3230
19
2560
28
2520
28
3065
20
3600
18
3300
19
3625
19
3590
19
2605
23
2370
28
Lesson 2: Linear Regression
Example 4: Weight vs. Mileage Rating
Weight
(pounds)
Miles Per
Gallon
3565
19
3440
20
3970
17
(e) Would it be reasonable to use the leastsquares regression line to predict the mileage
of a Honda Insighta hybrid gas and electric
car? Why?
3305
19
3340
20
3200
20
3230
19
2560
28
No. Since the hybrid uses a different fuel
source, we cannot expect its mileage to be
predicted by this model.
2520
28
3065
20
3600
18
3300
19
3625
19
3590
19
2605
23
2370
28
P 218, #14. The data represent the weight of
various domestic cars and their city mileage
rating (in mpg) for the 2001 model year.
Unit 2: Probability Distributions
Lesson 2: Linear Regression
Limitations of the Regression Model
If the least-squares regression line is used to make predictions based
on values of the predictor variable that are much larger or smaller
than the observed values, we say the researcher is working outside
the scope of the model.
Never use a least-squares regression line to make predictions outside
the scope of the model because we cant be sure the linear relation
continues to exist.
If the correlation coefficient is near zero, indicating a weak or nonexistent linear relationship between the variables, use the mean value
of the response variable as the
Unit 2: Probability Distributions
Lesson 2: Linear Regression
Example 5: Brain Size and Intelligence
P 219, #17. Researchers interested in whether a persons brain size is
related to mental capacity selected a sample of 20 students who had
SAT scores higher than 1350 and administered an IQ test. Brain size
was determined by an MRI scan.
(a) Find the least-squares
regression line treating
MRI count as the
predictor variable and
IQ as the response
variable.
= 0.000029x + 110
Unit 2: Probability Distributions
Gender
MRI
Count
IQ
Gender
Female
Female
816932
951545
133
137
Male
Male
949395 140
1001121 140
Female
991305
138
Male
1038437 139
Female
833868
132
Male
965353
133
Female
856472
140
Male
955466
133
Female
852244
132
Male
1079549 141
Female
790619
135
Male
924059
135
Female
866662
130
Male
955003
139
Female
857782
133
Male
935494
141
Female
948066
133
Male
949589
144
MRI
Count
IQ
Lesson 2: Linear Regression
Example 5: Brain Size and Intelligence
P 219, #17. Researchers interested in whether a persons brain size is
related to mental capacity selected a sample of 20 students who had
SAT scores higher than 1350 and administered an IQ test. Brain size
was determined by an MRI scan.
(b) What do you notice
about the value of the
slope?
The slope is near zero.
Why does this result
seem reasonable based
on the correlation
coefficient calculated
earlier.
Unit 2: Probability Distributions
Gender
MRI
Count
IQ
Gender
Female
Female
816932
951545
133
137
Male
Male
949395 140
1001121 140
Female
991305
138
Male
1038437 139
Female
833868
132
Male
965353
133
Female
856472
140
Male
955466
133
Female
852244
132
Male
1079549 141
Female
790619
135
Male
924059
135
Female
866662
130
Male
955003
139
Female
857782
133
Male
935494
141
Female
948066
133
Male
949589
144
MRI
Count
IQ
Lesson 2: Linear Regression
Example 5: Brain Size and Intelligence
P 219, #17. Researchers interested in whether a persons brain size is
related to mental capacity selected a sample of 20 students who had
SAT scores higher than 1350 and administered an IQ test. Brain size
was determined by an MRI scan.
MRI
Gender
Cou
Bnrtain ISQize G
vesnIdnetrelligC
en
oc
ue
nt
145
Female 816932 133
Male
949395
140
Female y 951545 137
Male
1001121
135
Female 991305 138
Male
1038437
130
Male
965353
Female 833868 132
125
Female 856472 140
Male
955466
00
90
8
0
1000
Female 852244 132
Male
1079549
MRI Count
Male
924059
Female 790619 135
1000
IQ
(c) When there is no
relation between the
predictor and response
variables, we use the
mean value y to predict.
Predict the IQ of an
individual whose MRI
count is 1,000,000. y = 13F6emale
Female
Female
Unit 2: Probability Distributions
MRI
IQ
140
140
139
133
133
1100
141
135
866662
130
Male
955003
139
857782
133
Male
935494
141
948066
133
Male
949589
144