Chapter 9
Simple Linear Relationship & Correlation
Prepared by: Dr. Elias Dabeet
DISPLAYING THE RELATIONSHIP
DEFINITIONS:
Studies are often conducted to attempt to show that some explanatory
variable “causes” the values of some response variable to occur.
The response or dependent variable is the response of interest, the
variable we want to predict, and is usually denoted by y.
The explanatory or independent variable attempts to explain the
response and is usually denoted by x.
A scatter plot shows the relationship between two quantitative variables
x and y. The values of the x variable are marked on the horizontal axis,
and the values of the y variable are marked in the vertical axis. Each pair
of observations (xi, yi), is represented as a point in the plot.
Two variables are said to be positively associated if, as x increases, the
values of y tends to increase. Two variables are said to be negatively
associated if, as x increases, the values of y tends to decrease.
When a scatter plot does not show a particular direction, neither positive,
nor negative, we say that there is no linear association.
1
X Y
Student Midterm Final
Number Score Score
1 39 62
2 44 69
3 32 68
4 40 86
5 45 88.5
6 46 88.5
7 33 76
8 39 66.5
9 32.5 75
10 21 38
11 30 71
12 39 88
13 44 96.5
14 28.5 71.5
15 38 96
16 43 82.5
17 42 85
18 25.5 28
19 47 95
20 36 39
21 31.5 58
22 32 49
23 42 62
24 21 59
25 41 90
Scatterplot of Final vs Midterm Scores
100
90
80
Final 7 0
60
50 The 10th student
40 (21 , 38)
30
20
10
0
0 10 20 30 40 50
Midterm
2
Notes of Caution
1. An observed relationship between two variables does
not imply that there is some causal link between the
two variables.
For example, consider the following scatter-plot of IQ score versus shoe size:
IQ
Shoe Size
As a person ages their shoe size increases as well as their IQ. Although there
is a positive association, there is no causal link between the two variables
shoes size and IQ.
Most studies attempt to show that some explanatory variable "causes" the
values of the response to occur. While we can never positively determine
whether or not there is a distinct cause-and-effect relationship, we can assess
if there appears to be such relationship.
.
2. Sometimes a scatter plot, such as the one in Figure
below, shows a curvilinear relationship between the data.
In this situation, Methods for curvilinear relationships are
beyond the scope of this course.
3
Simple Linear Regression
Scatterplot of Final vs Midterm Scores
100
90
Final 8 0
70
60
50
40 Line #1
30
20
10 Line #2
0
0 10 20 30 40 50
Midterm
So the question remains as to how to find a “best-fitting” line?
Equation of a Line
y=a+bx where
b = slope - the amount y changes when x is increased by 1 unit.
a = y-intercept - the value of y when x is set equal to zero.
4
DEFINITION::
The least squares regression line, given by y a bx , is the
line that makes the sum of the squared vertical deviations of the
data points from the line as small as possible. Performing the
regression is often stated as regress y on x .
Least squares regression line for regressing final exam scores, y,
on midterm exam scores, x , is given by y 7.5 175
. x.
5
Estimated slope of b =1.75 tells us that for a 1-point increase on
the midterm we would expect, on average, an increase of 1.75
points on the final exam.
Estimated y -intercept of a =7.5 tells us that if someone were to
score 0 points on the midterm, we would predict they would get
7.5 points on the final exam.
Suppose a new student scores 40 points on the midterm. Based
on our model, what would be their predicted final exam score?
Plug the value of x =40 into our estimated equation. The predicted
final Exam score is y 7.5 175
. (40) 77.5 points.
6
Calculating the Least Squares Regression Line
The Least Squares Regression Line
The least squares regression line is given by y a bx where
xi x yi y n xi yi xi yi
slope = b
i x x 2
n i i
x 2
x 2
y – intercept = a y bx
Example
Test 1 versus Test 2—Obtaining the Regression Line “By Hand”
7
(a) Look at the relationship graphically with a scatter-plot to
confirm initially that a linear model seems appropriate.
(b) Calculate the estimated regression line by completing the
calculation table shown below.
n xi yi xi yi 5 884 60 70 220
b ..
11
n xi2 xi 5 760 60 2
2
200
70 60
a y bx 11
. 0.8.
5 5
Least squares equation: 0.8 11
y . x.
Slope of the line is b = 1.1.
This means that Test 2 scores are expected to go up by 1.1 points
on average for each additional point scored on test 1.
A student who scored 15 points on Test 1 is predicted to score
y 08
. 11
. (15) 17.3 points on Test 2.
8
Example:
S UB JE CT AGE X GL UCO S E L E VE L Y XY X2 Y2
1 43 99 4257 1849 9801
2 21 65 1365 441 4225
3 25 79 1975 625 6241
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
6 59 81 4779 3481 6561
Total 247 486 20485 11409 40022
n x i y i x i y i 620485 247 486
b .385225
2
i
n x x i
2
611409 247
2
.385225
486 247
a y bx 65.1416
6 6
ˆ 65.1416 0.385225 x
y
CORRELATION: HOW STRONG IS THE LINEAR RELATIONSHIP?
DEFINITION:
The sample correlation coefficient r measures the strength of
the linear relationship between two quantitative variables. It
describes the direction of the linear association and indicates how
closely the points in a scatter-plot are to the least squares
regression line.
9
Features of the correlation coefficient.
1. Range 1 r 1
2. Sign The sign of the correlation coefficient
indicates direction of association — negative [-
1 , 0) or positive (0 , +1].
3. Magnitude The magnitude of the correlation
coefficient indicates the strength of the linear
association. If the data follow a straight line
r 1 (if the slope is positive) or r 1 (if
the slope is negative), indicating a perfect
linear association. If r 0 then there is no
linear association.
4. Measures Strength The correlation only measures the
strength of the linear association.
5. Unit-less The correlation is computed using standard
scores of the two variables. It has no unit of
measure and the absolute value of r will not
change if the units of measurement for x or y
are changed. The correlation between x and y
is the same as the correlation between y and
x.
Some Pictures....
y
x
x
x
x
x x x
x
x
x x
x x
x
x Positive, moderate to strong linear
association, r 0.8 .
10
y
x x
x
x x
x
x x
x x
x
x x
x
x Negative, weak linear association, r 0.2
y
x x
x x
x x
x x
x A strong association, just not a linear one,
r 0.
How to Calculate the Correlation Coefficient r
The formula:
n xi y i xi y i
r
n xi2 xi 2
n y i2 y i
2
Example Test 1 v e r s us Test 2 Obtaining t he Correlation
Coefficient
“By Hand”
We already have computed the summation quantities needed for
finding r, shown in the calculation table.
11
Completed Calculation Table
xi yi xi2 xi yi yi2
8 9 64 72 81
10 13 100 130 169
12 14 144 168 196
14 15 196 210 225
16 19 256 304 361
Total: x i 60 y i 70 x 2
i 760 x y i i 884 y 2
i 1032
n xi yi xi yi 5(884) (60)(70)
r 0.965
n xi2 xi n yi2 yi 5(760) (60) 2 5(1032) (70) 2
2 2
The large positive correlation coefficient and the scatter-plot
indicate a strong, positive, linear association between Test 1 and
Test 2
scores.
12
Let’s Do It At home Exercise! Birth Rates
We gathered data from 1970 for twelve nations on the percentage
of women aged 14 or older who were economically active and the
crude birth rate. (We define the crude birth rate as the number of
births in a year per 1000 population size) We are interested in the
relationship of the crude birth rate (y) on the percentage of women
who were economically active (x) Nation x y
Algeria 2 48
Argentina 19 21
Denmark 34 14
a. Create the scatter-plot. E. Germany 40 11
Determine if there is a Guatemala 8 41
positive, negative, or India 12 37
association between x and y. Ireland 20 22
Jamaica 20 31
Japan 37 19
Philippines 19 42
USA 30 15
Soviet Union 46 18
b. Find the equation of the regression line. Interpret the slope.
c. Find the correlation coefficient r.
13
THE SQUARED CORRELATION r 2 — WHAT DOES IT
TELL US?
r = correlation coefficient, gives the strength and the direction of
the linear relationship between two quantitative variables x and y;
–1 r 1.
Note that when we square r we get => 0 r2 1. The value of r2
Is the percentage of variation of dependent variable that are
explained by the independent variable x.
r2 = 0.75 => about 75% of the variation in the response variable
y can be explained by the linear relationship between x and y.
Exercises:
1. A sample of 12 occupational therapy students were subjected to a study by local hospitals
to test whether their knowledge of chemistry depends upon their intelligence test scores.
The twelve students were given a chemistry test and an intelligence test. Determine the
regression equation that can be used for prediction and draw the line.
Intelligence 65 50 55 65 55 70 65 70 55 70 50 55
scores
Chemistry 85 74 76 90 85 87 94 98 81 91 76 74
test scores
2. A medical study at the college of orthopedics conducted a study to determine the
correlation between height and weight of female children. A sample of 6 children were
selected and the data is in the following table. Determine the correlation coefficient.
Height weight
12 18
10 17
14 23
11 19
12 20
9 15
14
3. Dr. Green (a pediatrician) wanted to test if there is a correlation between the number of meals
consumed by a child per day (X) and the child weight (Y). Included you will find a table containing
the information on 5 of the children. Use the table to answer the following:
Child Number of meals child weight X² Y² XY
consumed per (Y)
day (X)
Ahmad 11 8 121 64 88
Ali 16 11 256 121 176
Osama 12 9 144 81 108
Husien 19 13 361 169 247
Total 58 41 882 435 619
a. Determine the simple linear regression equation.
b. Determine the correlation coefficient. Interpret it in words.
c. Determine the coefficient of determination. Interpret it in words.
d. What is the expected child weight if the number of meals increased by 2 meals per day?
4. A hospital supervisor wishes to find the relationship between the number of nurses on a
job and the number of patients examined for a shift. Listed below is the result for a sample
of 4 days. Let the number of patients ( y ) be the dependent variable.
Nurses ( x ) Patients ( y ) x2 y2 xy
9 12
3 14
5 11
8 13
a. Compute x , y , x 2 , y 2 , xy .
b. Determine the coefficient of correlation. Interpret its meaning.
c. Determine the estimated simple linear regression equation.
d. Determine the coefficient of determination. Interpret it in words.
e. If the number of nurses on a job changes by 3, what is the corresponding change in the
number of examined patients.
5. The data below was obtained in a study of age and systolic blood pressure of six randomly
selected subjects. Make a scatter plot to examine the relationship between (x) = age and (y) =
pressure. Comment on the relationship with respect to form, direction, strength, and any
departures or usual values.
Subject Age x Pressure y
A 43 128
B 48 120
C 56 135
D 61 143
E 67 141
F 70 152
15