STQT401
SIMPLE LINEAR REGRESSION
AND CORRELATION
Regression analysis
The nature of the relationship between 2 variables can take many forms, from simple
ones to extremely complicated mathematical functions
Used primarily for the purpose of prediction
The simplest relationship consists of a straight-line or linear relationship
Development of a statistical model that can be used to predict the values of a dependent
or response variable from the values of at least one explanatory or independent variable
Dependent variable
• The variable we wish to predict or explain
Independent variable
• The variable used to predict or explain the dependent variable
Scatter diagram
• Visualise the relationship between variables (independent variable on the horizontal X axis and a dependent
variable on the vertical Y axis)
• Helps suggest starting point for regression analysis
Positive straight-line relationship
∆𝑌𝑌 - Change in Y
a ∆X - Change in X
0 X
X – Independent Variable Y – Dependent Variable
Simple Linear regression model
Only one independent variable (X)
Relationship between X and Y is described by a linear function
Changes in Y are assumed to be related to changes in X
𝑌𝑌𝑐𝑐 = 𝑎𝑎 + 𝑏𝑏𝑏𝑏
Yc = Predicted value of Y for the observation (X variable)
a = Population Y intercept
b = slope (average change in Yc for each change of 1 in X) (Population slope coefficient)
X = Independent Variable
Types of relationships
Positive Linear Weak linear relationship Curvilinear relationship
Negative Linear
No relationship
Correlation Analysis
Used to measure the strength of the association between variables
The objective is not to use one variable to predict another, but rather to measure the
strength of the association or covariation that exists between two continuous
variables
Coefficient of Correlation (r): Measurement is on a scale for r between -1 and +1.
When r = 0, there is no relationship.
For r → +1, there is a strong positive relationship between the variables, i.e., as x
increases, y also increases.
For r → -1, there is a strong negative relationship between the variables, i.e., as x
increases, y decreases, and vice versa.
Association
Y Y Y
X X X
Perfect positive Perfect negative
No correlation
correlation correlation
• When one variable • When one variable • There is no
changes, the other changes, the other relationship between
variable changes in variable changes in the variables
the same direction the opposite direction • As X increases, there
• Y increases in a • Y decreases in a is no systematic
perfectly predictable perfectly predictable change in Y, so there
manner as X manner as X is no association
increases increases between the values of
X and the values of Y
Association
0.00
-0.72
0.50
-0.96
0.98
-0.45
Procedure for Calculation
1. Collect the data for both dependent (Y) and independent (X).
2. Arrange the data in two columns X and Y.
3. Compute Pearson's Coefficient of Correlation:
4. Determine the values of a and b:
n⋅ ∑ xy − ∑ 𝑥𝑥 ⋅ ∑ 𝑦𝑦
r= Coefficient of determination = 𝑟𝑟 2
n⋅ ∑ x 2 − ∑ 𝑥𝑥 2 ⋅ n⋅ ∑ y 2 − ∑ 𝑦𝑦 2
n⋅ ∑ xy − ∑ 𝑥𝑥 ⋅ ∑ 𝑦𝑦
𝑏𝑏 = 𝑎𝑎 = 𝑌𝑌 − 𝑏𝑏 ⋅ 𝑋𝑋
n⋅ ∑ x 2 − ∑ 𝑥𝑥 2
∑ 𝑌𝑌 ∑ 𝑋𝑋
𝑌𝑌 = 𝑋𝑋 =
𝑛𝑛 𝑛𝑛
Example
An engineer wishes to examine the relationship between the length of steel bars (cm) and its
respective weight (kg). A random sample of 10 steel bars is selected.
Length of steel bars (cm) Weight (kg)
1,40 2,45
1,60 3,12
1,70 2,79
1,88 3,08
1,10 1,99
1,55 2,19
2,35 4,05
2,45 3,24
1,43 3,19
1,70 2,55
a) Compute the coefficient of correlation and determination and interpret your answers.
b) Determine the regression equation and estimate the weight of 3cm of steel bar
c) Test the hypothesis if the coefficient of correlation in the population is zero. Use α = 0.05
Scatter Plot
4,5
3,5
3
Weight (Kg)
2,5
1,5
0,5
0
0,00 0,50 1,00 1,50 2,00 2,50 3,00
Length (cm)
Computing the values
X - Length (cm) Y - Weight (kg) 𝒙𝒙𝟐𝟐 𝒚𝒚𝟐𝟐 XxY
1,40 2,45 1,96 6,00 3,43
1,60 3,12 2,56 9,73 4,99
1,70 2,79 2,89 7,78 4,74
1,88 3,08 3,53 9,49 5,79
1,10 1,99 1,21 3,96 2,19
1,55 2,19 2,40 4,80 3,39
2,35 4,05 5,52 16,40 9,52
2,45 3,24 6,00 10,50 7,94
1,43 3,19 2,04 10,18 4,56
1,70 2,55 2,89 6,50 4,34
17,16 28,65 31,02 85,34 50,89
Computing r and b
n⋅ ∑ xy − ∑ 𝑥𝑥 ⋅ ∑ 𝑦𝑦
r=
n⋅ ∑ x 2 − ∑ 𝑥𝑥 2 ⋅ n⋅ ∑ y 2 − ∑ 𝑦𝑦 2
10 X 50.89 − 17,16 X 28.65
r= = 0,764
10 X 31.02 − 17.16 2 X 10 X 85.34 − 28.65 2
There is a moderate positive relationship between the length of steel bars and the weight
𝑟𝑟 2 = 0.7642 = 0.584
58% of the variation in the weight of steel bars can be explained by the variability in the length.
n⋅ ∑ xy − ∑ 𝑥𝑥 ⋅ ∑ 𝑦𝑦
𝑏𝑏 =
n⋅ ∑ x 2 − ∑ 𝑥𝑥 2
10 X 50.89 − 17.16 X 28.65
𝑏𝑏 = = 1.1
10 𝑋𝑋 31.02 − 17.16 2
Computing a
∑ 𝑌𝑌 28.65
𝑌𝑌 = = = 2.865
𝑛𝑛 10
∑ 𝑋𝑋 17.16
𝑋𝑋 = = = 1.716
𝑛𝑛 10
𝑎𝑎 = 𝑌𝑌 − 𝑏𝑏 ⋅ 𝑋𝑋
𝑎𝑎 = 2.865 − (1.10 𝑥𝑥 1.716) = 0,977
Graphical Presentation
4,5
3,5
Slope (b) = 1.1
Weight (Kg)
2,5
1,5
0,5
0
Intercept (a) 0,00 0,50 1,00 1,50 2,00 2,50 3,00
Length (cm)
= 0,977
𝑌𝑌(𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊) = 𝑎𝑎 + 𝑏𝑏𝑋𝑋(𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿) Y = 0.977 + 1.1𝑋𝑋
Predicting the dependent variable
Predict the weight of a steel bar that has a length of 3cm.
Y = 0.977 + 1.1𝑋𝑋
Y = 0.977 + (1,10 X 3) = 4.28
The predicted weight for a length of steel bar that is 3cm long = 4.28 kg
Hypothesis Testing (Correlation coefficient (r) for Liner Regression)
State the null hypothesis (𝐻𝐻𝑜𝑜 )
• Ho: 𝜌𝜌 = 0
State the alternate hypothesis (𝐻𝐻𝑎𝑎 )
• Ha: 𝜌𝜌 ≠ 0
Specify the level of significance to be used for t test
• α values are 0,01; 0,02; 0,05 and 0,10.
Critical value tc
• tc = t(degrees of freedom = n-2)
Decision rule:
• Accept Ho if -tc < t < tc
t test
𝑟𝑟 − 𝜌𝜌
𝑡𝑡 =
1 − 𝑟𝑟 2
𝑛𝑛 − 2 State the decision
• r – correlation value (sample)
• n – number of samples
• 𝜌𝜌 - population correlation coefficient
Example
Test the hypothesis that the coefficient of correlation in the experiment is zero. Use α = 0.05
Ho: 𝜌𝜌 = 0 (There is no correlation)
Ha: 𝜌𝜌 ≠ 0 (There is correlation)
2-tail t test α = 0.05 df = n – 2 = 10 – 2 = 8
Decision rule: Accept Ho if -2,306 < t < 2,306
𝑟𝑟−𝜌𝜌 0.764−0
Test statistic 𝑡𝑡 = = = 3.34
1−𝑟𝑟2 1−0.584
𝑛𝑛−2 10−2
Decision: Since t falls in the region of rejection, Ho is rejected
Conclusion: There is a relationship between the variables