Chapter 4 - Notes
Chapter 4 - Notes
➢ Perfect negative correlation occurs when all the plotted points lie
exactly on the fitted line with a negative gradient. In this case 𝑟 = −1.
No linear relationship, 𝑟 ≈ 0
Coefficient of Determination r^2
A coefficient of determination is calculated by squaring the
correlation coefficient, i.e. 𝑟 2 .
Interpretation of the coefficient of determination:
𝑟 2 represents the proportion of the variability in the response variable (𝑦)
that is explained by its linear relationship with the explanatory variable
(𝑥).The coefficient of determination (r²) measures how well a statistical model predicts an outcome. The outcome is represented
by the model’s dependent variable. The lowest possible value of r² is 0 and the highest possible value is 1. Put simply, the
better a model is at making predictions, the closer its r² will be to 1.
➢ N.B. it is important to view the relationship between two continuous
variables using a scatter plot before any statistical analysis is done.
➢ This assists in determining whether the relationship is linear. If it is not
linear, then we cannot use 𝑟 and 𝑟 2 to describe the relationship.
Types of relationships that may exists:
Relationships between Continuous Variables
7
This is the type of relationship we will be considering…
7
The Simple Linear Regression Model
The correlation coefficient, 𝑟, is used to quantify the linear relationship
between continuous variables.
In this section, we use a simple linear regression model to define the
linear relationship between a response variable, 𝑌𝑖 , and a predictor
variable, 𝑥𝑖 : 𝑌 =𝛽 +𝛽 𝑥 +𝜀 𝑖 = 1, 2, … , 𝑁 population size
𝑖 0 1 𝑖 𝑖
This model represents the
where 𝛽0 is the y-intercept (constant term)
true relationship between 𝑥
𝛽1 is the slope/gradient of the line and 𝑌 (for the population)
𝜀𝑖 is an error term where it is assumed that 𝜀𝑖 ~𝑁(0, 𝜎 2 ) and that
these error terms are independent of each other.
• 𝛽1 corresponds to the magnitude of change in the response variable
given a one unit change in the predictor variable.
• 𝛽0 and 𝛽1 are unknown population parameters.
Assumptions of Linear Regression
• Similar to the ANOVA statistical model from chapter 3, the statistical
model used in linear regression also has assumptions that need to be
met before one is able to use it:
L I N E
✓ The most obvious assumption is that there is a linear relationship
between the response and independent/predictor variable.
l
✓ The random error term in the model, 𝜀 , is assumed to have
𝑖
a normal distribution with a mean of zero. n
✓ The random error term in the model, 𝜀 , is assumed to have
𝑖
a constant variance,𝜎 . I.e. 𝜀 ~𝑁(0, 𝜎 ).
2
𝑖
2
e
✓ The errors are independent. i
• We can use graphical techniques and statistical tests to determine if
these assumptions are met.
Estimating the Regression Line
• As the goal in simple linear regression is usually to characterize the
relationship between the response and predictor variables in the
population, a sample of data is used.
• From this sample, the unknown population parameters (𝛽0 , 𝛽1 ) that
define the assumed relationship between the response and predictor
variables are estimated.
• Estimates of the unknown population parameters 𝛽0 and 𝛽1 are
obtained by the method of least squares.
• This method provides the estimates by determining the line that
minimizes the sum of the squared vertical distances between the
observations and the fitted line.
• In other words, the fitted or regression line is as close as possible to all
the observed data points.
For example:
The estimated regression line (also known as the line of best fit) is
determined by minimizing the sum of the squared differences shown
by the blue vertical lines above.
The line of best fit/least squares regression line is given by (also
known as the estimated regression line):
𝑌𝑖 = 𝛽መ0 + 𝛽መ1 𝑥𝑖 𝑖 = 1, 2, … , 𝑛
where: 𝑌𝑖 is the fitted/estimated 𝑦-value
𝛽መ0 is the estimated y-intercept (i.e. a sample estimate for 𝛽0 )
𝛽መ1 is the estimated gradient/slope (i.e. a sample estimate for 𝛽1 )
𝑥𝑖 is the value of the independent variable.
𝑛 is the sample size. i.e. 𝑛 (𝑥, 𝑦) coordinates have been sampled.
These estimates are calculated by:
𝑆𝑥𝑦
𝛽መ0 = 𝑦ത − 𝛽መ1 𝑥ҧ 𝛽መ1 =
𝑆𝑥𝑥
∑𝑥 2
(∑𝑥)(∑𝑦) 2
𝑆𝑥𝑦 = 𝑥𝑦 − 𝑆𝑥𝑥 = 𝑥 −
𝑛 𝑛
𝑆𝑥𝑦
2 𝛽መ0 = 𝑦ത − 𝛽መ1 𝑥ҧ 1 𝛽መ1 =
𝑆𝑥𝑥
• The estimated gradient of the slope 𝛽መ1 is calculated first using (1), then
using the result of 𝛽መ1 , the estimated y-intercept 𝛽መ0 is determined.
∑𝑦
• To determine 𝛽0 , the mean of the observed 𝑦-values (given by 𝑦ത = )
መ
𝑛
∑𝑥
and the mean of the observed 𝑥 -values (given by 𝑥ҧ = ) are
𝑛
substituted into the formula (2) above.
Recall: 𝜷
𝟎 and 𝜷 𝟏 can be calculated using STAT mode on the
calculator, similar to finding the correlation coefficient, 𝒓.
Using the notation on the calculator, 𝐴 = 𝛽መ0 and 𝐵 = 𝛽መ1 .
Hypothesis Testing in Simple Linear Models
• The estimates 𝛽መ0 and 𝛽መ1 depend on the sample that is drawn. If
another sample is obtained, it would lead to different values for 𝛽መ0
and 𝛽መ1 .
• Therefore, in this section, we make inferences about the real/true
values of 𝛽0 and 𝛽1 , thus we perform hypothesis tests for each of them.
The additional ‘0’ in the
For 𝜷𝟎 : (the y-intercept of the line) subscript distinguishes 𝛽0 from 𝛽1
Specific value of the parameter,
Null hypothesis: 𝐻0 : 𝛽0 = 𝛽00 based on the value in the claim.
𝛽መ0 − 𝛽00
𝑇= ~𝑡(𝑛 − 2)
1 𝑥ҧ 2
𝑆𝑒 +
𝑛 𝑆𝑥𝑥
where:
𝛽መ0 is the sample estimate (y-intercept of the estimated regression line)
𝛽00 is specified value under the null hypothesis
2
𝑆𝑥𝑦 𝑆𝑦𝑦 − 𝛽መ1 𝑆𝑥𝑦
𝑆𝑦𝑦 − ൘𝑆𝑥𝑥 =
𝑆𝑒 = 𝑛−2
𝑛−2
2 ∑𝑦 2
(∑𝑥)(∑𝑦) 2
∑𝑥 2
𝑆𝑥𝑦 = 𝑥𝑦 − 𝑆𝑥𝑥 = 𝑥 − 𝑆𝑦𝑦 = 𝑦 −
𝑛 𝑛 𝑛
For 𝜷𝟏 : (the slope/gradient of the line)
Specific value of the parameter,
Null hypothesis: 𝐻0 : 𝛽1 = 𝛽10 based on the value in the claim.
Rejection Rule: If the calculated test statistic lies in the rejection region
reject 𝐻0 in favour of 𝐻1 .i.e. For 𝐻1 : 𝛽0 < 𝛽00 reject 𝐻0 if 𝑇 < −t 𝑛−2; 𝛼 or
For 𝐻1 : 𝛽1 < 𝛽10 reject 𝐻0 if 𝑇 ∗ < −t 𝑛−2; 𝛼
For 𝐻1 : 𝛽0 > 𝛽00 or
The area of the
𝐻1 : 𝛽1 > 𝛽10
rejection region = 𝛼
(right-tailed test)
Rejection Rule: If the calculated test statistic lies in the rejection region
reject 𝐻0 in favour of 𝐻1 . i.e. For 𝐻1 : 𝛽0 > 𝛽00 reject 𝐻0 if 𝑇 > t 𝑛−2; 𝛼 or
𝑦 12.5 18.6 25.3 24.8 35.7 45.4 44.4 45.8 65.3 75.7 72.3 79.2
2) Approximately how many copies of the book would be sold if R20 000
was the advertising budget?
𝐻1 : 𝛽1 ≠ 2
𝑥 2 = 3396.92 𝑦 2 = 30656.5
∑𝑥 2 2
2 178.8
∴ 𝑆𝑥𝑥 = 𝑥 − = 3396.92 − = 732.8
𝑛 12
∑𝑦 2 2
2 545
∴ 𝑆𝑦𝑦 = 𝑦 − = 30656.5 − = 5904.42
𝑛 12
𝑛 = 12 𝑥 = 178.8 𝑦 = 545 𝑦ො𝑖 = 6.53 + 2.61𝑥𝑖
2
𝑆𝑥𝑦 𝑆𝑦𝑦 − 𝛽መ1 𝑆𝑥𝑦 5904.42 − 2.61(1912.39)
𝑆𝑦𝑦 − ൘𝑆𝑥𝑥 =
𝑆𝑒 = = 12 − 2
𝑛−2 𝑛−2
= 9.56
Since we have already obtained the regression line, use the second
formula for 𝑆𝑒 .
(𝛽መ1 − 𝛽10 ) 𝑆𝑥𝑥 (2.61 − 2) 732.8
∴ 𝑇∗ = = = 1.73
𝑆𝑒 9.56
𝛼
Since 𝐻1 : 𝛽1 ≠ 2 (Two-tailed) 𝛼 = 0.05 ∴ = 0.025
2
−t 𝑛−2; 𝛼
2 = −𝑡10; 0.025 = −2.228
Rejection Rule: Reject the null hypothesis if 𝑇 ∗ < −2.228 or if 𝑇 ∗ > 2.228
Decision: Since −2.228 < 𝑇 ∗ = 1.73 < 2.228 do not reject 𝐻0 at a 5% l.o.s.
= (−4.24 ; 17.30)
For the gradient/slope (𝛽1 ):
𝛽መ1 = 2.61 𝑆𝑒 = 9.56 𝑆𝑥𝑥 = 732.8
𝑆𝑒 ∴ 𝑡𝑛−2 ; 𝛼 = 𝑡10 ; = 1.812
𝛽መ1 ± 𝑡𝑛−2 ; 𝛼 2 0.05
2 𝑆𝑥𝑥
Note: additional content has been added to this section for 2020. Thus, the extra
examples and past papers do not cover this whole section.
Multiple Linear Regression
A 3D example (2 explanatory
variables) looks like:
X1
X2
Picturing the model:
If there is no relationship among 𝑌 and 𝑋1 and 𝑋2 , the model is a
horizontal plane passing through the point (𝑌 = 𝛽0 , 𝑋1 = 0, 𝑋2 = 0).
b0
X1
X2
Prediction:
• If the primary interest of fitting a multiple regression model is
prediction, then the terms in the model, the values of their
coefficients, and their statistical significance are of secondary
importance.
• The focus is on producing a model that is the best at predicting
future values of 𝑌 as a function of the 𝑋𝑠. The predicted/estimated
value of 𝑌 is given by this formula:
If the estimated linear regression model does not fit the data better than
the baseline model, you fail to reject the null hypothesis.
✓ Thus, you do not have enough evidence to say that all of the
regression coefficients differ from zero. The predictor variables do
not explain a significant amount of variability in the response
variable, thus they do not have a significance effect on the
response.
If the estimated linear regression model does fit the data better than the
baseline model, you reject the null hypothesis.
✓ Thus, you do have enough evidence to say that at least one of the
regression parameters differs from zero. At least one predictor variable
explains a significant amount of variability in the response variable,
thus at least one has a significant effect on the response.
ANOVA for Multiple Linear Regression
• An Analysis of Variance (ANOVA) can be used to determine if the
estimated multiple regression model fits the data well or not, i.e. to test
the following hypotheses:
Degrees of
𝑛−1 = 𝑝 + 𝑛−1−𝑝
freedom:
• 𝑌𝑖 = the values of the raw responses
• 𝑌ത = the average of the responses
(based on the fitted
• 𝑌𝑖 = the estimated/predicted value of a response regression model)
• 𝑛 = the sample size
• 𝑝 = the number of predictor variables in the fitted regression line (this
is equal to the number of regression coefficients (𝛽𝑗 ’s, 𝑗 = 1, … , 𝑝)
ANOVA for Multiple Linear Regression
Source of Sum of Mean sum of
𝑑𝑓 F-statistic
Variation Squares squares
Regression/ 𝑆𝑆𝑅
𝑆𝑆𝑅 𝑝 𝑀𝑆𝑅 =
model 𝑝 𝑀𝑆𝑅
𝐹𝑐𝑎𝑙𝑐 =
𝑆𝑆𝐸 𝑀𝑆𝐸
Error 𝑆𝑆𝐸 𝑛 − 1 − 𝑝 𝑀𝑆𝐸 =
𝑛−1−𝑝
Total 𝑆𝑆𝑇𝑜𝑡𝑎𝑙 𝑛−1 Note: 𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = 𝑛 − 1 × 𝑉𝑎𝑟 𝑌
𝐻0 : 𝛽𝑗 = 0 vs 𝐻1 : 𝛽𝑗 ≠ 0
Example:
Suppose we wish to assess the effect of BMI, age and waist
circumference on the systolic blood pressure of a person using a sample
of 250 observations.
The following results were obtained for the coefficients of each predictor
variable: Variable Estimate P-value
Intercept 68.15 0.0001
BMI -1.23 0.0001
Age 0.65 0.0024
Waist Circum. -0.58 0.1381
a) Determine the estimated regression line.
Where
𝑌𝑖 is the estimated systolic blood pressure for the 𝑖𝑡ℎ person, 𝑖 = 1, … , 250
𝑥1𝑖 corresponds to the value of the predictor variable BMI (𝑛 = 250, the
𝑥2𝑖 corresponds to the value of the predictor variable Age sample size)
• The P-values for BMI and age are significant (both are less than 5%).
Therefore, the null hypothesis concerning their corresponding
coefficient equaling 0 is rejected. Thus, these variables have a
significant effect on the response variable.
• The P-value for waist circumference = 0.1381 > 0.05, therefore the
coefficient corresponding to this variable is not significantly different
from 0 (as 𝐻0 : 𝛽3 = 0 is not rejected). Thus, this variable does not have
a significant effect on the systolic blood pressure of an individual.
d) Interpret the results of the estimates for BMI and age.
Variable Estimate P-value
Intercept 68.15 0.0001
BMI -1.23 0.0001
Age 0.65 0.0024
Waist Circum. -0.58 0.1381
BMI:
For one unit increase in BMI, there is a 1.23 units decrease in systolic
blood pressure. Thus, BMI has a decreased effect on systolic blood
pressure (as its estimate is negative).
Age:
For one unit increase in age, there is a 0.65 units increase in systolic
blood pressure. Thus, age has an increased effect on systolic blood
pressure (as its estimate is positive).
Some notes on multiple linear regression:
• You can also explore possible interaction effects between two or
more of the explanatory/predictor variables.
• You can include the effects of qualitative/categorical variables in
addition to that of the continuous explanatory variables in the same
model. This model is known as an ANCOVA (analysis of covariance)
model.