Chapter 9
Correlation and Regression
Simple Linear Regression
24 January 2024
24 January 2024 Statistics - Spring Semester 2023-2024
24 January 2024 Statistics - Spring Semester 2023-2024
24 January 2024 Statistics - Spring Semester 2023-2024
24 January 2024 Statistics - Spring Semester 2023-2024
24 January 2024 Statistics - Spring Semester 2023-2024
24 January 2024 Statistics - Spring Semester 2023-2024
24 January 2024 Statistics - Spring Semester 2023-2024
Regression
• Using regression analysis, we can derive an equation by which the
dependent variable (Y) is expressed (and estimated) in terms of
its relationship with the independent variable (X).
• In simple regression, there is only one independent variable (X) and
one dependent variable (Y). The dependent variable is the
outcome we are trying to predict.
• In multiple regression, there are several independent variables (X1,
X2, … ), and still only one dependent variable, Y. We are trying to use
the X variables to predict the Y variable.
• We will be studying Simple Linear Regression, in which we
investigate whether X and Y have a linear relationship.
24 January 2024 Statistics - Spring Semester 2023-2024
A Simple Regression Example
• A researcher wishes to determine the relationship between X and Y. It is
important to know which is the dependent variable, the one you are
trying to predict. Presumably, hours studied affects grades on a quiz, not
the other way around. The data:
X (hours) Y (Grade on quiz)
1 40
2 50
3 60
4 70
5 80
24 January 2024 Statistics - Spring Semester 2023-2024
• If X and Y have a linear relationship, and you would like to plot the line,
what would it look like?
• Take a moment and try it. Note that the data points fall on a straight
line.
• Now, add another point: If X=6, then Y= ?
• We see that as X changes by 1, Y changes by 10. That is, the slope, called
b1 is equal to 10.
• b1 = ∆Y / ∆X = 10.
• The Y-intercept, or the value of Y when X=0, called b0 is equal to 30.
• When X = 0, then Y= 30. Someone who studies 0 hours for the quiz,
should expect a grade of 30.
• Then, every additional hour studied adds 10 points to the score on the
quiz.
• The equation for the plot of this data is: Ŷ = 30 + 10X
24 January 2024 Statistics - Spring Semester 2023-2024
• If you plot the above X and Y, you will find that all the points are
on a straight line. This, of course, means that we have a perfect
relationship between X and Y and all the points are on the line.
• From studying correlation you know that for this data r=+1 and
R2=100%.
90
80
70
60
50
40
30
20
10
0
0 1 2 3 4 5 6
24 January 2024 Statistics - Spring Semester 2023-2024
Simple Linear Regression
• In general, the simple linear regression equation is:
Ŷi = b0 + b1Xi
• Why do we need regression in addition to correlation?
• To predict a Y for a new value of X
• To answer questions regarding the slope. For example,
• With additional shelf space (X), what effect will there be on sales
(Y)?
• If we raise prices by a particular amount or percentage, will it
cause sales to drop?
• It makes the scatter plot a better display (graph) of the data if we
can plot a line through it. It presents much more information on
the diagram.
.
• In correlation, on the other hand, we just want to know if two variables
are related. This is used a lot in social science research.
24 January 2024 Statistics - Spring Semester 2023-2024
• The regression equation Ŷi = b0 + b1Xi is a sample estimator of the true
population regression equation, which we could build were we to take a
census:
• Yi = β0 + β1Xi + εi
where,
β0 = true Y intercept for the population
β1 = true slope for the population
εi = random error in Y for observation i
• In regression analysis, we hypothesize that there is a true regression line
for the population. The b0 and b1 coefficients are estimates of the true
population coefficients, β0 and β1.
24 January 2024 Statistics - Spring Semester 2023-2024
Ŷi is a point on the regression line
Yi is an individual data value
• The deviations of the individual observations (the points) from the
regression line, (Yi - Ŷi), the residuals, are denoted by ei where ei = (Yi - Ŷi).
• Some deviations are positive (for the points above the line); some are
negative (for the points below the line). If a point is on the line, its
deviation = 0. Note that the Σei = 0.
24 January 2024 Statistics - Spring Semester 2023-2024
• Mathematically, the regression line minimizes the sum of the squared
errors (SSE):
Σei2 = Σ(Yi - Ŷi)2 = Σ[Yi – (b0 + b1Xi)]2
• Regression is called a “least squares” method.
• Taking partial derivatives, we get the “normal equations” that are
used to solve for b0 and b1.
• In regression, the levels of X are considered to be fixed. Y is the
random variable.
• This is why the regression line is called the least squares line. It is the
line that minimizes the sum of squared residuals (SSE).
• In the example below, we see that
most of the points are either above
the line or below the line. Only about
5 points are actually on the line or
touching it.
24 January 2024 Statistics - Spring Semester 2023-2024
Steps in Regression
1- For Xi (independent variable) and Yi (dependent variable),
Calculate:
ΣYi
ΣXi
ΣXiYi
ΣXi2
ΣYi2
2- Calculate the correlation coefficient, r:
nX i Yi − (X i )(Yi )
r=
nX i
2
− (X i )
2
nY
i
2
− (Yi )
2
-1 ≤ r ≤ 1
[This can be tested for significance. H0: ρ=0. If the correlation is no
then X and Y are not related. You really should not be doing this re
24 January 2024 Statistics - Spring Semester 2023-2024
3- Calculate the coefficient of determination: r2 = (r)2
0 ≤ r2 ≤ 1
This is the proportion of the variation in the dependent variable (Yi) explained by
the independent variable (Xi)
4- Calculate the regression coefficient b1 (the slope):
nX i Yi − (X i )(Yi )
b1 =
nX i2 − (X i )
2
Note that you have already calculated the numerator and the denominator for parts
of r. Other than a single division operation, no new calculations are required.
BTW, r and b1 are related. If a correlation is negative, the slope term must be
negative; a positive slope means a positive correlation.
5- Calculate the regression coefficient b0 (the Y-intercept, or constant):
b0 = Y − b1 X
The Y-intercept (b0) is the predicted value of Y when X = 0.
6- The regression equation (a straight line) is:
Yˆi = b0 + b1Xi
24 January 2024 Statistics - Spring Semester 2023-2024
Example: The shear resistance of soil, y (kN/m2), is determined by x y
measurements as a function of the normal stress, x (kN/m2). The 10 14.1
data are as shown below: 11 15.6
12 16.9
1- Find the simple linear regression model. 13 17.7
2- State the strength and the nature of the relation between x and y. 14 18.3
15 20.0
Solution: step-1 16 21.0
n=10 x y x.y x2 y2 17 21.7
1 10 14.1 141 100 198.81 18 22.6
𝑥𝑖 = 145 2 11 15.6 171.6 121 243.36 19 24.0
3 12 16.9 202.8 144 285.61
𝑦𝑖 = 191.9 4 13 17.7 230.1 169 313.29
5 14 18.3 256.2 196 334.89
𝑥𝑖 . 𝑦𝑖 = 2869.4 6 15 20.0 300 225 400
7 16 21.0 336 256 441
𝑥𝑖2 = 2185 8 17 21.7 368.9 289 470.89
9 18 22.6 406.8 324 510.76
10 19 24.0 456 361 576
𝑦𝑖2 = 3774.61
sum 145 191.9 2869.4 2185 3774.61
24 January 2024 Statistics - Spring Semester 2023-2024
2- Calculate the correlation coefficient, r:
nX i Yi − (X i )(Yi )
r=
nX i
2
− (X i )
2
nY
i
2
− (Yi )
2
-1 ≤ r ≤ 1
[This can be tested for significance. H0: ρ=0. If the correlation is not
then X and Y are not related. You really should not be doing this reg
10 2869.4 − (145)(191.9)
𝑟= = 0.9966
10 2185 − (145)2 10 3774.61 − (191.9)2
Therefore the relation is strong and direct.
3- Calculate the coefficient of determination: r2 = (r)2
0 ≤ r2 ≤ 1
This is the proportion of the variation in the dependent variable (Yi) explained by
r2 = 0.99662 = 0.993
the independent variable (Xi)
4- Calculate the regression coefficient b1 (the slope):
nX i Yi − (X i )(Yi )
24 January 2024 Statistics - Spring Semester 2023-2024
b =
nX Y − (X )(Y )
i i i i
b1 =
X i − (X i ) coefficient b (the slope):
nregression
2 2
4- Calculate the 1
Note that you nXhave
i Yi − (X i )(Yi )
already calculated the numerator and the denominator for parts
b =
− (X i division
)2
1
of r. Other than
nXai2single operation, no new calculations are required.
BTW,that
Note r andyoub1have
are related.
already If a correlation
calculated is negative,and
the numerator thethe
slope term mustfor
denominator be parts
negative;
of r. Other 868.5
a than
positivea slopedivision
single means aoperation,
positive correlation.
no new calculations are required.
𝑏1 = = 1.0527
BTW, r and b825 1 are related. If a correlation is negative, the slope term must be
negative; a positive slope means a positive correlation.
5- Calculate the regression coefficient b0 (the Y-intercept, or constant):
b0 = Y − b1 X
5- Calculate the regression coefficient b0 (the Y-intercept, or constant):
b = Y
191.9
The Y-intercept − b (b
X 0 ) is the 145 value of Y when X = 0.
predicted
𝑏𝑜 0= 1 − 1.0527 = 3.9258
10 10
The Y-intercept (b0) is the predicted value of Y when X = 0.
6- The regression equation (a straight line) is:
Yˆi = b0 + b1Xi
ഥ𝑖 = 3.9258 + 1.0527𝑋𝑖
𝑌
7- [OPTIONAL] Then we can test the regression for statistical significance.
There are 3 ways to do this in simple regression:
(a) t-test for correlation:
24 January 2024 Statistics - Spring Semester 2023-2024
H0: ρ=0
Example: Hours Studied & Grades
Let’s examine the relationship between hours studied and grade on a quiz.
n=7 pairs of data. Highest grade on quiz is a 15.
X ≡ hours studied; Y ≡ grade on quiz.
Xi Yi XiYi Xi2 Yi2
1 5 5 1 25
2 8 16 4 64
3 9 27 9 81
4 10 40 16 100
5 11 55 25 121
6 12 72 36 144
7 14 98 49 196
ΣX= 28 ΣY= 69 ΣXY= 313 ΣX2= 140 ΣY2= 731
24 January 2024 Statistics - Spring Semester 2023-2024
∑X = 28
∑Y = 69
∑XY = 313
∑X2 = 140
∑Y2 = 731
Calculate the correlation coefficient, r:
7(313) − (28)(69) 259 259
r= = = = 0.98
(7(140) − (28) )(7(731) − (69) )
2 2
(196)(356) 264.2
24 January 2024 Statistics - Spring Semester 2023-2024
Calculate the coefficient of determination, R2:
R2 = (0.98)2 = .9604
Calculate the regression coefficient b1 (the slope):
7(313) − (28)(69) 259
b1 = = = 1.3
7(140) − (28) 2
196
Calculate the regression coefficient b0 (the Y-intercept, or
constant):
bo=(69/7)-1.32(28/7)=4.58
• The regression equation:
Yˆi = 4.58 + 1.32 X i
24 January 2024 Statistics - Spring Semester 2023-2024