[go: up one dir, main page]

0% found this document useful (0 votes)
10 views58 pages

Chapter 4 - Notes

This document outlines the concepts of correlation and simple linear regression, focusing on the relationship between two continuous variables. It covers the assumptions of linear regression, the estimation of regression parameters, and hypothesis testing for the intercept and slope of the regression line. Additionally, it explains the calculation of the correlation coefficient and the coefficient of determination, emphasizing the importance of graphical analysis before statistical evaluation.

Uploaded by

surabhi1080
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views58 pages

Chapter 4 - Notes

This document outlines the concepts of correlation and simple linear regression, focusing on the relationship between two continuous variables. It covers the assumptions of linear regression, the estimation of regression parameters, and hypothesis testing for the intercept and slope of the regression line. Additionally, it explains the calculation of the correlation coefficient and the coefficient of determination, emphasizing the importance of graphical analysis before statistical evaluation.

Uploaded by

surabhi1080
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Outline

▪ Section 4.1: STAT130 Recap of bivariate data


o Linear correlation
o Simple linear regression

▪ Section 4.2: Simple linear regression inference


o Assumptions for linear regression
o Hypothesis tests for the true value of the intercept and gradient of
the line
o Confidence intervals for the true value of the intercept and gradient
of the line
Introduction
➢ Recall in the previous chapter, you learned that when you have
response
a
categorical predictor variable/factor and a continuous outcome
variable you use ANOVA to analyze your data.
➢ In this chapter, we are interested in exploring the relationship
between two continuous variables, particularly the linear relationship
between them. This can be done by correlation and regression
analysis.
➢ In this case, we have a continuous independent variable (𝑥) and
continuous dependent variable (𝑦).
➢ 𝑥 is also known as an explanatory or predictor variable, and 𝑦 a
response variable.
Correlation (STAT130 Recap)
➢ Recall from STAT130: the correlation coefficient (𝑟) is used to measure
the strength of the linear relationship between two continuous
variables.
➢ This value of 𝑟 is an estimate of the true correlation between the
values in the population, and it is calculated based on the sample
collected for each variable.
➢ The sign of 𝑟 (either positive or negative) determines the type of linear
relationship between the two variables:
✓ When 𝑟 > 0, there is a positive linear relationship.
✓ When 𝑟 < 0, there is a negative linear relationship.
➢ Recall: STAT mode on your calculator can be used to find 𝑟.
Correlation (STAT130 Recap)…
➢ Perfect positive correlation occurs when all the plotted points lie
exactly on the fitted line with a positive gradient. In this case 𝑟 = 1.

Strong, positive linear relationship, 𝑟 ≈ 1

➢ Perfect negative correlation occurs when all the plotted points lie
exactly on the fitted line with a negative gradient. In this case 𝑟 = −1.

Strong, negative linear relationship, 𝑟 ≈ −1

No linear relationship, 𝑟 ≈ 0
Coefficient of Determination r^2
A coefficient of determination is calculated by squaring the
correlation coefficient, i.e. 𝑟 2 .
Interpretation of the coefficient of determination:
𝑟 2 represents the proportion of the variability in the response variable (𝑦)
that is explained by its linear relationship with the explanatory variable
(𝑥).The coefficient of determination (r²) measures how well a statistical model predicts an outcome. The outcome is represented
by the model’s dependent variable. The lowest possible value of r² is 0 and the highest possible value is 1. Put simply, the
better a model is at making predictions, the closer its r² will be to 1.
➢ N.B. it is important to view the relationship between two continuous
variables using a scatter plot before any statistical analysis is done.
➢ This assists in determining whether the relationship is linear. If it is not
linear, then we cannot use 𝑟 and 𝑟 2 to describe the relationship.
Types of relationships that may exists:
Relationships between Continuous Variables

7
This is the type of relationship we will be considering…
7
The Simple Linear Regression Model
The correlation coefficient, 𝑟, is used to quantify the linear relationship
between continuous variables.
In this section, we use a simple linear regression model to define the
linear relationship between a response variable, 𝑌𝑖 , and a predictor
variable, 𝑥𝑖 : 𝑌 =𝛽 +𝛽 𝑥 +𝜀 𝑖 = 1, 2, … , 𝑁 population size
𝑖 0 1 𝑖 𝑖
This model represents the
where 𝛽0 is the y-intercept (constant term)
true relationship between 𝑥
𝛽1 is the slope/gradient of the line and 𝑌 (for the population)
𝜀𝑖 is an error term where it is assumed that 𝜀𝑖 ~𝑁(0, 𝜎 2 ) and that
these error terms are independent of each other.
• 𝛽1 corresponds to the magnitude of change in the response variable
given a one unit change in the predictor variable.
• 𝛽0 and 𝛽1 are unknown population parameters.
Assumptions of Linear Regression
• Similar to the ANOVA statistical model from chapter 3, the statistical
model used in linear regression also has assumptions that need to be
met before one is able to use it:
L I N E
✓ The most obvious assumption is that there is a linear relationship
between the response and independent/predictor variable.
l
✓ The random error term in the model, 𝜀 , is assumed to have
𝑖
a normal distribution with a mean of zero. n
✓ The random error term in the model, 𝜀 , is assumed to have
𝑖
a constant variance,𝜎 . I.e. 𝜀 ~𝑁(0, 𝜎 ).
2
𝑖
2
e
✓ The errors are independent. i
• We can use graphical techniques and statistical tests to determine if
these assumptions are met.
Estimating the Regression Line
• As the goal in simple linear regression is usually to characterize the
relationship between the response and predictor variables in the
population, a sample of data is used.
• From this sample, the unknown population parameters (𝛽0 , 𝛽1 ) that
define the assumed relationship between the response and predictor
variables are estimated.
• Estimates of the unknown population parameters 𝛽0 and 𝛽1 are
obtained by the method of least squares.
• This method provides the estimates by determining the line that
minimizes the sum of the squared vertical distances between the
observations and the fitted line.
• In other words, the fitted or regression line is as close as possible to all
the observed data points.
For example:

The estimated regression line (also known as the line of best fit) is
determined by minimizing the sum of the squared differences shown
by the blue vertical lines above.
The line of best fit/least squares regression line is given by (also
known as the estimated regression line):
𝑌෠𝑖 = 𝛽መ0 + 𝛽መ1 𝑥𝑖 𝑖 = 1, 2, … , 𝑛
where: 𝑌෠𝑖 is the fitted/estimated 𝑦-value
𝛽መ0 is the estimated y-intercept (i.e. a sample estimate for 𝛽0 )
𝛽መ1 is the estimated gradient/slope (i.e. a sample estimate for 𝛽1 )
𝑥𝑖 is the value of the independent variable.
𝑛 is the sample size. i.e. 𝑛 (𝑥, 𝑦) coordinates have been sampled.
These estimates are calculated by:
𝑆𝑥𝑦
𝛽መ0 = 𝑦ത − 𝛽መ1 𝑥ҧ 𝛽መ1 =
𝑆𝑥𝑥

∑𝑥 2
(∑𝑥)(∑𝑦) 2
𝑆𝑥𝑦 = ෍ 𝑥𝑦 − 𝑆𝑥𝑥 = ෍𝑥 −
𝑛 𝑛
𝑆𝑥𝑦
2 𝛽መ0 = 𝑦ത − 𝛽መ1 𝑥ҧ 1 𝛽መ1 =
𝑆𝑥𝑥

• The estimated gradient of the slope 𝛽መ1 is calculated first using (1), then
using the result of 𝛽መ1 , the estimated y-intercept 𝛽መ0 is determined.
∑𝑦
• To determine 𝛽0 , the mean of the observed 𝑦-values (given by 𝑦ത = )

𝑛
∑𝑥
and the mean of the observed 𝑥 -values (given by 𝑥ҧ = ) are
𝑛
substituted into the formula (2) above.

Recall: 𝜷
෡ 𝟎 and 𝜷 ෡ 𝟏 can be calculated using STAT mode on the
calculator, similar to finding the correlation coefficient, 𝒓.
Using the notation on the calculator, 𝐴 = 𝛽መ0 and 𝐵 = 𝛽መ1 .
Hypothesis Testing in Simple Linear Models
• The estimates 𝛽መ0 and 𝛽መ1 depend on the sample that is drawn. If
another sample is obtained, it would lead to different values for 𝛽መ0
and 𝛽መ1 .
• Therefore, in this section, we make inferences about the real/true
values of 𝛽0 and 𝛽1 , thus we perform hypothesis tests for each of them.
The additional ‘0’ in the
For 𝜷𝟎 : (the y-intercept of the line) subscript distinguishes 𝛽0 from 𝛽1
Specific value of the parameter,
Null hypothesis: 𝐻0 : 𝛽0 = 𝛽00 based on the value in the claim.

Alt. hypotheses: 𝐻1 : 𝛽0 < 𝛽00 (left-tailed)


The three
𝐻1 : 𝛽0 > 𝛽00 (right-tailed)
usual
𝐻1 : 𝛽0 ≠ 𝛽00 (two-tailed) alternatives
Test Statistic for 𝜷𝟎 : The degrees of freedom
for this t-test is 𝑛 − 2

𝛽መ0 − 𝛽00
𝑇= ~𝑡(𝑛 − 2)
1 𝑥ҧ 2
𝑆𝑒 +
𝑛 𝑆𝑥𝑥
where:
𝛽መ0 is the sample estimate (y-intercept of the estimated regression line)
𝛽00 is specified value under the null hypothesis

2
𝑆𝑥𝑦 𝑆𝑦𝑦 − 𝛽መ1 𝑆𝑥𝑦
𝑆𝑦𝑦 − ൘𝑆𝑥𝑥 =
𝑆𝑒 = 𝑛−2
𝑛−2
2 ∑𝑦 2
(∑𝑥)(∑𝑦) 2
∑𝑥 2
𝑆𝑥𝑦 = ෍ 𝑥𝑦 − 𝑆𝑥𝑥 = ෍𝑥 − 𝑆𝑦𝑦 = ෍𝑦 −
𝑛 𝑛 𝑛
For 𝜷𝟏 : (the slope/gradient of the line)
Specific value of the parameter,
Null hypothesis: 𝐻0 : 𝛽1 = 𝛽10 based on the value in the claim.

(left-tailed) The ‘1’ in the subscript shows that


Alt. hypotheses: 𝐻1 : 𝛽1 < 𝛽10
the parameter of interest is now 𝛽1
𝐻1 : 𝛽1 > 𝛽10 (right-tailed)
𝑆𝑒 and 𝑆𝑥𝑥 are
𝐻1 : 𝛽1 ≠ 𝛽10 (two-tailed)
calculated using the
same formulae on the
Test Statistic for 𝜷𝟏 : previous slide.

(𝛽መ1 − 𝛽10 ) 𝑆𝑥𝑥


𝑇∗ = ~𝑡(𝑛 − 2)
𝑆𝑒

𝛽መ1 is the sample estimate (slope/gradient of the estimated regression line)


𝛽10 is the specified value under the null hypothesis
Since both test statistics follow a t-distribution with 𝑛 − 2 degrees of
freedom, the critical values for both tests will follow the same
distribution:
For 𝐻1 : 𝛽0 < 𝛽00 or
The area of the
𝐻1 : 𝛽1 < 𝛽10 rejection region = 𝛼
(left-tailed test)

P-value = 𝑃(𝑇 < 𝑇𝑐𝑎𝑙𝑐 )


critical value t 𝑛−2;1−𝛼 = −t 𝑛−2; 𝛼

Rejection Rule: If the calculated test statistic lies in the rejection region
reject 𝐻0 in favour of 𝐻1 .i.e. For 𝐻1 : 𝛽0 < 𝛽00 reject 𝐻0 if 𝑇 < −t 𝑛−2; 𝛼 or
For 𝐻1 : 𝛽1 < 𝛽10 reject 𝐻0 if 𝑇 ∗ < −t 𝑛−2; 𝛼
For 𝐻1 : 𝛽0 > 𝛽00 or
The area of the
𝐻1 : 𝛽1 > 𝛽10
rejection region = 𝛼
(right-tailed test)

1−𝛼 P-value = 𝑃(𝑇 > 𝑇𝑐𝑎𝑙𝑐 )


N.B. do not forget
the degrees of
freedom is 𝑛 − 2! t 𝑛−2; 𝛼 critical value

Rejection Rule: If the calculated test statistic lies in the rejection region
reject 𝐻0 in favour of 𝐻1 . i.e. For 𝐻1 : 𝛽0 > 𝛽00 reject 𝐻0 if 𝑇 > t 𝑛−2; 𝛼 or

For 𝐻1 : 𝛽1 > 𝛽10 reject 𝐻0 if 𝑇 ∗ > t 𝑛−2; 𝛼


For 𝐻1 : 𝛽0 ≠ 𝛽00 or (two-tailed test)
𝐻1 : 𝛽1 ≠ 𝛽10

Area of lower Area of upper


𝛼
rejection region =
𝛼 rejection region =
2 2

−t 𝑛−2; 𝛼 = t 𝑛−2;1− 𝛼 t 𝑛−2; 𝛼 Two critical


2
2 2 values
P-value = 2 × 𝑃(𝑇 > |𝑇𝑐𝑎𝑙𝑐 |)

Rejection Rule: For 𝐻1 : 𝛽0 ≠ 𝛽00 reject 𝐻0 if 𝑇 < −t 𝑛−2; 𝛼 or if 𝑇 > t 𝑛−2; 𝛼


2 2

For 𝐻1 : 𝛽1 ≠ 𝛽10 reject 𝐻0 if 𝑇 ∗ < −t 𝑛−2; 𝛼 or if 𝑇 ∗ > t 𝑛−2; 𝛼


2 2
1 − 𝛼 100% confidence intervals:
1 𝑥ҧ 2 𝑆𝑒
For 𝛽0 : 𝛽መ0 ± 𝑡𝑛−2 ; 𝛼 𝑆𝑒 + For 𝛽1 : 𝛽መ1 ± 𝑡𝑛−2 ; 𝛼
2 𝑛 𝑆𝑥𝑥 2 𝑆𝑥𝑥

Testing for a significant effect of 𝒙 on 𝒚 (significant association):


• 𝛽1 (the gradient/slope) is the coefficient for the independent variable 𝑥
in the statistical model.
• Therefore, if 𝛽1 = 0, 𝑥 has no effect on the response variable 𝑦 (there is
no association between the variables).
• Thus, we can test 𝐻0 : 𝛽1 = 0 vs 𝐻1 : 𝛽1 ≠ 0 to determine if there is a
significant association between them.
• We can also use the confidence interval to determine this information.
• If the confidence interval for 𝛽1 contains 0, then we can conclude
there is no significant association between 𝑥 and 𝑦 , or 𝑥 has no
significant effect on 𝑦.
Example:
The number of copies sold of a new book (measured in thousands of
units) is dependent on the advertising budget the publisher commits in a
pre-publication campaign (measured in thousands of Rands).
The results of 12 recently published books are shown below:

𝑥 8 9.5 7.2 6.5 10 12 11.5 14.8 17.3 27 30 25

𝑦 12.5 18.6 25.3 24.8 35.7 45.4 44.4 45.8 65.3 75.7 72.3 79.2

1) Determine the estimated regression line between the advertising


budget and the number of copies of the book sold.
𝑦 = number of copies sold (dependent variable)
𝑥 = the advertising budget (independent variable)

𝑦ො𝑖 = 𝛽መ0 + 𝛽መ1 𝑥𝑖


Using STAT mode
on the calculator: 𝛽መ0 = 6.53 𝛽መ1 = 2.61 ∴ 𝑦ො𝑖 = 6.53 + 2.61𝑥𝑖

2) Approximately how many copies of the book would be sold if R20 000
was the advertising budget?

∴ 𝑦ො𝑖 = 6.53 + 2.61(20)


i.e. 𝑥 = 20 (in thousands)
= 58.73

Therefore, approximately 58 730 copies of the book will be sold.


𝛽1
3) Test the hypothesis that the true value of the slope of the line is 2.
(use 𝛼 = 0.05)
∴ 𝐻0 : 𝛽1 = 2 𝛽10 = 2

𝐻1 : 𝛽1 ≠ 2

(𝛽መ1 − 𝛽10 ) 𝑆𝑥𝑥


Test statistic for 𝛽1 : 𝑇∗ =
𝑆𝑒
2
𝑆𝑥𝑦 𝑆𝑦𝑦 − 𝛽መ1 𝑆𝑥𝑦
𝑆𝑦𝑦 − ൘𝑆𝑥𝑥
Therefore, we need to determine: 𝑆𝑒 = =
𝑛−2
𝑛−2
And:
∑𝑥 2 ∑𝑦 2
(∑𝑥)(∑𝑦) 2 2
𝑆𝑥𝑦 = ෍ 𝑥𝑦 − 𝑆𝑥𝑥 = ෍𝑥 − 𝑆𝑦𝑦 = ෍𝑦 −
𝑛 𝑛 𝑛
𝑥 𝑦 𝑥𝑦 𝑥2 𝑦2 𝑛 = 12
8 12.5 100 64 156.25
9.5 18.6 176.7 90.25 345.96 ෍ 𝑥 = 178.8
7.2 25.3 182.16 51.84 640.09
6.5 24.8 161.2 42.25 615.04
෍ 𝑦 = 545
10 35.7 357 100 1274.49
12 45.4 544.8 144 2061.16
11.5 44.4 510.6 132.25 1971.36
෍ 𝑥𝑦 = 10032.89
14.8 45.8 677.84 219.04 2097.64
17.3 65.3 1129.69 299.29 4264.09
27 75.7 2043.9 729 5730.49 ෍ 𝑥 2 = 3396.92
30 72.3 2169 900 5227.29
25 79.2 1980 625 6272.64
Sum: 178.8 545 10032.89 3396.92 30656.5 ෍ 𝑦 2 = 30656.5

(∑𝑥)(∑𝑦) 178.8 × 545


∴ 𝑆𝑥𝑦 = ෍ 𝑥𝑦 − = 10032.89 − = 1912.39
𝑛 12
𝑛 = 12 ෍ 𝑥 = 178.8 ෍ 𝑦 = 545 ෍ 𝑥𝑦 = 10032.89

෍ 𝑥 2 = 3396.92 ෍ 𝑦 2 = 30656.5

∑𝑥 2 2
2 178.8
∴ 𝑆𝑥𝑥 = ෍𝑥 − = 3396.92 − = 732.8
𝑛 12

∑𝑦 2 2
2 545
∴ 𝑆𝑦𝑦 = ෍𝑦 − = 30656.5 − = 5904.42
𝑛 12
𝑛 = 12 ෍ 𝑥 = 178.8 ෍ 𝑦 = 545 𝑦ො𝑖 = 6.53 + 2.61𝑥𝑖

𝑆𝑥𝑦 = 1912.39 𝑆𝑥𝑥 = 732.8 𝑆𝑦𝑦 = 5904.42


𝛽መ1 = 2.61

2
𝑆𝑥𝑦 𝑆𝑦𝑦 − 𝛽መ1 𝑆𝑥𝑦 5904.42 − 2.61(1912.39)
𝑆𝑦𝑦 − ൘𝑆𝑥𝑥 =
𝑆𝑒 = = 12 − 2
𝑛−2 𝑛−2

= 9.56
Since we have already obtained the regression line, use the second
formula for 𝑆𝑒 .
(𝛽መ1 − 𝛽10 ) 𝑆𝑥𝑥 (2.61 − 2) 732.8
∴ 𝑇∗ = = = 1.73
𝑆𝑒 9.56
𝛼
Since 𝐻1 : 𝛽1 ≠ 2 (Two-tailed) 𝛼 = 0.05 ∴ = 0.025
2

Critical values: t 𝑛−2; 𝛼 = 𝑡10; 0.025 = 2.228


2

−t 𝑛−2; 𝛼
2 = −𝑡10; 0.025 = −2.228

Rejection Rule: Reject the null hypothesis if 𝑇 ∗ < −2.228 or if 𝑇 ∗ > 2.228

Decision: Since −2.228 < 𝑇 ∗ = 1.73 < 2.228 do not reject 𝐻0 at a 5% l.o.s.

Conclusion: We do not have enough evidence to conclude that the


slope of the line is not 2.
𝛽0
4) Test the hypothesis that the true value of the y-intercept of the line
is more than 6. (use 𝛼 = 0.10)
∴ 𝐻0 : 𝛽0 = 6 𝛽00 = 6
𝐻1 : 𝛽0 > 6

Test statistic for 𝛽0 : 𝛽መ0 − 𝛽00


𝑇=
1 𝑥ҧ 2
𝑆𝑒 +
𝑛 𝑆𝑥𝑥
𝛽መ0 = 6.53
We have already 𝑦ො𝑖 = 6.53 + 2.61𝑥𝑖 𝑆𝑒 = 9.56 𝑆𝑥𝑥 = 732.8
obtained
∑𝑥 178.8
Since ෍ 𝑥 = 178.8 then 𝑥ҧ = = = 14.9
𝑛 12
𝛽መ0 = 6.53 𝛽00 = 6 𝑥ҧ = 14.9 𝑆𝑒 = 9.56 𝑆𝑥𝑥 = 732.8

𝛽መ0 − 𝛽00 6.53 − 6 = 0.089


∴𝑇= =
1 𝑥ҧ 2 1 (14.9)2
𝑆𝑒 + 9.56 +
𝑛 𝑆𝑥𝑥 12 732.8

𝐻1 : 𝛽0 > 6 (right-tailed) 𝛼 = 0.1

Critical value: t 𝑛−2; 𝛼 = 𝑡10; 0.10 = 1.372

Rejection Rule: Reject the null hypothesis if 𝑇 > 1.372

Decision: Since 𝑇 < 1.372 do not reject 𝐻0 at a 10% l.o.s.

Conclusion: We do not have enough evidence to conclude that the


y-intecept of the line is more than 6.
5) Construct a 90% confidence interval for the true value of the
y-intercept of the line, as well as the true value of the gradient.

For the y-intercept (𝛽0 ): 1 − 𝛼 100% = 90%


∴ 1 − 𝛼 = 0.90
1 𝑥ҧ 2
𝛽መ0 ± 𝑡𝑛−2 ; 𝛼 𝑆𝑒 + 𝛼 = 0.10
2 𝑛 𝑆𝑥𝑥 𝛼
∴ = 0.05
2
𝛽መ0 = 6.53 𝑥ҧ = 14.9 𝑆𝑒 = 9.56 𝑆𝑥𝑥 = 732.8
𝑡𝑛−2 ; 𝛼
2
1 (14.9)2
∴ 6.53 ± 1.812(9.56) + = 𝑡10 ; 0.05
12 732.8
= 1.812

= (−4.24 ; 17.30)
For the gradient/slope (𝛽1 ):
𝛽መ1 = 2.61 𝑆𝑒 = 9.56 𝑆𝑥𝑥 = 732.8
𝑆𝑒 ∴ 𝑡𝑛−2 ; 𝛼 = 𝑡10 ; = 1.812
𝛽መ1 ± 𝑡𝑛−2 ; 𝛼 2 0.05
2 𝑆𝑥𝑥

The t-multiplier in this confidence interval is the same as that in the


confidence interval for 𝛽0 on the previous slide.
9.56
∴ 2.61 ± 1.812 = (1.97 ; 3.25)
732.8
➢ Note: This 90% confidence interval for 𝛽1 does not contain 0.
Therefore, there is a significant association between the advertising
budget and the number of copies of the book sold.
➢ The same conclusion would have been reached by performing a
test for 𝐻0 : 𝛽1 = 0 vs 𝐻1 : 𝛽1 ≠ 0 at a 10% level of significance.
Outline
▪ Section 4.3: Multiple linear regression (basic introduction)
o Introduction and assumptions
o ANOVA test for the overall fit of the model
o Hypothesis tests to assess relationships between the response
variable and predictor variables
o Prediction using the model
o Adjusted R-square

Note: additional content has been added to this section for 2020. Thus, the extra
examples and past papers do not cover this whole section.
Multiple Linear Regression

▪ Multiple linear regression enables you to investigate the relationship


between 𝑌 and several independent variables simultaneously.
▪ Multiple linear regression is a powerful tool for the following tasks:
✓ Prediction – to develop a model to predict future values of a
response variable ( 𝑌 ) based on its relationships with other
predictor variables (𝑋s)
✓ Analytical or Explanatory Analysis – to develop an understanding
of the relationships between the response variable and predictor
variables (is there relationship significant or not?)
Multiple Linear Regression Model
The statistical model is given by
𝑌𝑖 = 𝛽0 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 + ⋯ + 𝛽𝑝 𝑥𝑝𝑖 + 𝜀𝑖
Sometimes the 𝑖 in the subscript is dropped for simplicity:
𝑌 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ + 𝛽𝑝 𝑥𝑝 + 𝜀
𝑌𝑖 is the response variable, 𝛽0 is the intercept of the model, 𝛽𝑗 , 𝑗 = 1, … , 𝑝,
are the regression coefficients/parameters for each explanatory/predictor
variables, 𝑥𝑖𝑗 , and 𝜀𝑖 are the usual random error terms with 𝜀𝑖 ~𝑁(0, 𝜎 2 )
Note the usual assumptions for linear regression hold:
• The error terms are independently and normally distributed with a
mean of 0 and constant variance 𝜎 2 .
• There is a linear relationship between the response and explanatory
variables.
Picturing the model:
In simple linear regression, you can model the relationship between
the two variables (two dimensions) with a line (one dimension).

In multiple linear regression,


Y
the relationship between
the variables is modelled by
a plane.

A 3D example (2 explanatory
variables) looks like:
X1

X2
Picturing the model:
If there is no relationship among 𝑌 and 𝑋1 and 𝑋2 , the model is a
horizontal plane passing through the point (𝑌 = 𝛽0 , 𝑋1 = 0, 𝑋2 = 0).

b0

X1

X2
Prediction:
• If the primary interest of fitting a multiple regression model is
prediction, then the terms in the model, the values of their
coefficients, and their statistical significance are of secondary
importance.
• The focus is on producing a model that is the best at predicting
future values of 𝑌 as a function of the 𝑋𝑠. The predicted/estimated
value of 𝑌 is given by this formula:

𝑌෠𝑖 = 𝛽መ0 + 𝛽መ1 𝑥1𝑖 + 𝛽መ2 𝑥2𝑖 + ⋯ + 𝛽መ𝑝 𝑥𝑝𝑖

(Note: the error term falls away in the fitted model)

𝛽መ𝑗 is the estimate of 𝛽𝑗 based on the sample of data being used.


Analytical or Explanatory Analysis:
If the focus is on understanding the relationship/association between
the dependent variable and the independent variables, then the
statistical significance of the coefficients is important as well as the
magnitudes and signs of the coefficients.
𝑌෠𝑖 = 𝛽መ0 + 𝛽መ1 𝑥1𝑖 + 𝛽መ2 𝑥2𝑖 + ⋯ + 𝛽መ𝑝 𝑥𝑝𝑖
Recall from section 4.2: Statistical inference can be performed on the
regression coefficient (𝛽𝑗 ), i.e. hypothesis tests and confidence intervals
concerning their true values. Similarly, we can do the same for each
parameter in the model above (each 𝛽𝑗 ).
The statistical significance of the coefficients, 𝛽𝑗 , (based on 𝐻0 : 𝛽𝑗 = 0 vs
𝐻1 : 𝛽𝑗 ≠ 0) as well as the magnitudes, signs and confidence intervals of
the coefficients are used to describe the relationship between each 𝑥𝑖𝑗
and 𝑌𝑖 and determine the significance of their association.
Overall fit of the model
• A hypothesis test for the overall fit of the estimated regression model
can be performed to determine how well it is fitting the data.
• This estimated multiple regression model is compared to the baseline
model, where the baseline model is simply the intercept model:
𝑌෠𝑖 = 𝛽መ𝑜
Null hypothesis:
𝐻0 : The regression model does not fit the data better than the baseline
model or Note: this only refers to the coefficients of the
𝐻0 : 𝛽1 = ⋯ = 𝛽𝑝 = 0 predictor variables (𝑥𝑗 ’s) all being equal to 0,
not the intercept term 𝛽0 .
Alternative hypothesis:
𝐻1 : The regression model does fit the data better than the baseline
model or
𝐻1 : Not all 𝛽𝑗 ’s = 0 or At least one 𝛽𝑗 does not equal 0.
Overall fit of the model

If the estimated linear regression model does not fit the data better than
the baseline model, you fail to reject the null hypothesis.
✓ Thus, you do not have enough evidence to say that all of the
regression coefficients differ from zero. The predictor variables do
not explain a significant amount of variability in the response
variable, thus they do not have a significance effect on the
response.
If the estimated linear regression model does fit the data better than the
baseline model, you reject the null hypothesis.
✓ Thus, you do have enough evidence to say that at least one of the
regression parameters differs from zero. At least one predictor variable
explains a significant amount of variability in the response variable,
thus at least one has a significant effect on the response.
ANOVA for Multiple Linear Regression
• An Analysis of Variance (ANOVA) can be used to determine if the
estimated multiple regression model fits the data well or not, i.e. to test
the following hypotheses:

𝐻0 : 𝛽1 = ⋯ = 𝛽𝑝 = 0 vs 𝐻1 : At least one 𝛽𝑗 does not equal 0.


• Similar to ANOVA for experimental designs, the variation in the
response (𝑌) gets partitioned into its various sources: the variation
explained by the explanatory variables (𝑥𝑗 ’s) and residual variation
not explained by the explanatory variables.
• Variation in the response refers to the sum of the squared differences
between the values of 𝑌 and the mean value of 𝑌 , expressed
mathematically as ෍ 𝑌 − 𝑌ത 2 (this is the total sum of squares)
𝑖
ANOVA for Multiple Linear Regression
𝑛 𝑛

𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = ෍ 𝑌𝑖 − 𝑌ത 2 = ෍ 𝑌𝑖2 − 𝑛𝑌ത 2


𝑖=1 𝑖=1
𝑆𝑆𝑇𝑜𝑡𝑎𝑙
Note: 𝑉𝑎𝑟 𝑌 = is 𝐺2
𝑛−1 This is the same as 𝐶𝑀 = in
the total variation in 𝑌. 𝑛
ANOVA for experimental designs
This total sum of squares can be partitioned into:

𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = 𝑆𝑆𝑅 + 𝑆𝑆𝐸


Where
• 𝑆𝑆𝑅 is the sum of squares due to the fitted regression line (variation
explained by the fitted model) and
• 𝑆𝑆𝐸 is the residual variation (the variation in the response that is not
being explained by the fitted model)
ANOVA for Multiple Linear Regression
𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = 𝑆𝑆𝑅 + 𝑆𝑆𝐸
𝑛 𝑛 𝑛
2 2
෍ 𝑌𝑖 − 𝑌ത 2
= ෍ 𝑌෠𝑖 − 𝑌ത + ෍ 𝑌𝑖 − 𝑌෠𝑖
𝑖=1 𝑖=1 𝑖=1

Degrees of
𝑛−1 = 𝑝 + 𝑛−1−𝑝
freedom:
• 𝑌𝑖 = the values of the raw responses
• 𝑌ത = the average of the responses
(based on the fitted
• 𝑌𝑖 = the estimated/predicted value of a response regression model)

• 𝑛 = the sample size
• 𝑝 = the number of predictor variables in the fitted regression line (this
is equal to the number of regression coefficients (𝛽𝑗 ’s, 𝑗 = 1, … , 𝑝)
ANOVA for Multiple Linear Regression
Source of Sum of Mean sum of
𝑑𝑓 F-statistic
Variation Squares squares
Regression/ 𝑆𝑆𝑅
𝑆𝑆𝑅 𝑝 𝑀𝑆𝑅 =
model 𝑝 𝑀𝑆𝑅
𝐹𝑐𝑎𝑙𝑐 =
𝑆𝑆𝐸 𝑀𝑆𝐸
Error 𝑆𝑆𝐸 𝑛 − 1 − 𝑝 𝑀𝑆𝐸 =
𝑛−1−𝑝
Total 𝑆𝑆𝑇𝑜𝑡𝑎𝑙 𝑛−1 Note: 𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = 𝑛 − 1 × 𝑉𝑎𝑟 𝑌

𝐻0 : 𝛽1 = ⋯ = 𝛽𝑝 = 0 is rejected if 𝐹𝑐𝑎𝑙𝑐 > 𝐹𝑝,𝑛−1−𝑝;𝛼

If 𝐻0 is rejected in favour of 𝐻1 : At least one 𝛽𝑗 does not equal 0, then


we can conclude that the fitted regression model fits the data fairly
well (better than the baseline/intercept model).
Assessing the relationships between the response
and predictor variables
• If 𝐻1 : At least one 𝛽𝑗 does not equal 0 is concluded from the ANOVA
test, then we can perform a hypothesis test on each individual
regression coefficient in the model to determine if its corresponding
variable has a significant effect on the response (i.e. if the variable
has a significant relationship/association with the response).
• This is done by performing the following hypothesis test:

𝐻0 : 𝛽𝑗 = 0 vs 𝐻1 : 𝛽𝑗 ≠ 0

• If the null hypothesis is NOT rejected, then it suggests that 𝛽𝑗 = 0. Thus,


the term 𝛽𝑗 𝑥𝑗 in the regression model falls away, which means the
variable 𝑥𝑗 makes no contribution to the response, 𝑌 (i.e. it has no
significant effect on 𝑌 or association with 𝑌).
Assessing the relationships between the response
and predictor variables
𝑌𝑖 = 𝛽0 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 + ⋯ + 𝛽𝑝 𝑥𝑝𝑖 + 𝜀𝑖

E.g. if 𝛽1 = 0, then 𝛽1 𝑥1𝑖 falls away in the above model.


NOTE: Even though the hypothesis test is performed on the regression
coefficient ( 𝛽𝑗 ), the result informs us about the significance of the
relationship of 𝑥𝑗 with 𝑌 (NOT 𝛽𝑗 and 𝑌).
If 𝛽1 ≠ 0, then 𝛽1 𝑥1𝑖 is retained in the model, thus informing us that 𝑥𝑗 has
a significant effect/contribution to 𝑌 (therefore 𝑥𝑗 and 𝑌 have a
significant relationship).
The estimate for 𝛽𝑗 (which is given by 𝛽መ𝑗 ) represents the how much the
response 𝑌 increases by (if 𝛽መ𝑗 > 0) or decreases by (if 𝛽መ𝑗 < 0) for ONE unit
increase in 𝑥𝑗 .
Adjusted R-Square

• Recall: If prediction is the main interest in fitting a multiple regression


model, then it is of interest to find the ‘best’ fitting model, which is the
fitted model that explains the most variability in the response.
• Recall from simple linear regression that 𝑟 2 (also known as R-square) is
the coefficient of determination (the square of the correlation
coefficient, 𝑟) and represents the proportion of the variability in the
response variable (𝑦) that is explained by its linear relationship with the
explanatory variable (𝑥).
• This R-square value can also be produced for a fitted multiple
regression model, and used to choose the ‘best’ fitting model where
the higher R-square, the better the model.
Adjusted R-Square

• However, R-square always increases or stays the same as you include


more terms/variables in the model.
• Therefore, choosing the ‘best’ model is not as simple as just making
the R-square as large as possible.
• For this reason, the adjusted R-square is another measure similar to R-
square, but it takes into account the number of terms/variables in the
model.
• It can be thought of as a penalized version of R-square with the
penalty increasing with each parameter added to the model.
Adjusted R-Square
• The adjusted R-square for a fitted model with an intercept term is
given by: (𝑛 − 1)(1 − 𝑟 2) 𝑆𝑆𝐸/(𝑛 − 1 − 𝑝)
2
𝑅𝑎𝑑𝑗 = 1 − =1−
𝑛 − (𝑝 + 1) 𝑆𝑆𝑇𝑜𝑡𝑎𝑙/𝑛 − 1
Where:
𝑛 = the sample size/number of observations used to fit the model,
𝑝 = the number of regression coefficients/predictor variables in the
regression model (i.e. excluding the intercept term)
𝑟 2 = unadjusted R-square value, (the square of the correlation
coefficient, 𝑟): 𝑆𝑆𝑅
2
𝑟 = (Coefficient of determination)
𝑆𝑆𝑇𝑜𝑡𝑎𝑙
Note: We can use 𝑟 2 (R-square) to find 𝑆𝑆𝑅 if we are given both 𝑟 2 and
𝑆𝑆𝑇𝑜𝑡𝑎𝑙 or 𝑉𝑎𝑟 𝑌 .
Note: How to obtain the estimate of each parameter in the multiple
regression model will not be covered in STAT140. However, statistical
software can be used to produce these estimates and obtain measures
such as 𝑟 2 , after which we can perform an ANOVA for the overall fit of
the fitted model.

Example:
Suppose we wish to assess the effect of BMI, age and waist
circumference on the systolic blood pressure of a person using a sample
of 250 observations.
The following results were obtained for the coefficients of each predictor
variable: Variable Estimate P-value
Intercept 68.15 0.0001
BMI -1.23 0.0001
Age 0.65 0.0024
Waist Circum. -0.58 0.1381
a) Determine the estimated regression line.

Variable Estimate P-value


Intercept 𝛽መ0 68.15 0.0001
BMI 𝛽መ1 -1.23 0.0001
(Note that the
Age 𝛽መ2 0.65 0.0024 estimated
Waist Circum. 𝛽መ3 -0.58 0.1381 regression line
does not include
an error term)
𝑌෠𝑖 = 68.15 − 1.23𝑥1𝑖 + 0.65𝑥2𝑖 − 0.58𝑥3𝑖

Where
𝑌෠𝑖 is the estimated systolic blood pressure for the 𝑖𝑡ℎ person, 𝑖 = 1, … , 250
𝑥1𝑖 corresponds to the value of the predictor variable BMI (𝑛 = 250, the
𝑥2𝑖 corresponds to the value of the predictor variable Age sample size)

𝑥3𝑖 corresponds to the value of the predictor variable Waist circum.


b) It was determined that the variance of the observed systolic
blood pressure readings was 25, and the coefficient of
determination for the fitted model in part a) was 0.85. Perform
an ANOVA test to determine if the overall fitted model performs
better than the baseline model. Use a 5% level of significance.
We are given the following information: 𝑛 = 250
𝑆𝑆𝑇𝑜𝑡𝑎𝑙 𝑆𝑆𝑅
𝑉𝑎𝑟 𝑌 = 25 = 𝑟2 = 0.85 =
𝑛−1 𝑆𝑆𝑇𝑜𝑡𝑎𝑙

∴ 𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = 25(250 − 1) ∴ 𝑆𝑆𝑅 = 0.85(6225) = 5291.25


= 6225

∴ 𝑆𝑆𝐸 = 𝑆𝑆𝑇𝑜𝑡𝑎𝑙 − 𝑆𝑆𝑅 = 6225 − 5291.5 = 933.75


Source of Sum of Mean sum of
𝑑𝑓 F-statistic
Variation Squares squares
Regression/ 5291.25 3 1763.75
model
464.14
Error 933.75 246 3.80

Total 6225 249

𝑛 = 250 𝑝 = 3 (3 predictor variables are in the fitted model)


𝐻0 : 𝛽1 = ⋯ = 𝛽3 = 0 (model is not a better fit) vs
𝐻1 : At least one 𝛽𝑗 does not equal 0 (the model is a better fit)
Critical value = 𝐹3,246;0.05 = 2.60 (use 𝑑𝑓2 = ∞ in the F-tables)
∴ Since 𝐹𝑐𝑎𝑙𝑐 = 464.14 > 2.60, 𝐻0 is rejected at a 5% l.o.s. Thus, the fitted
model is better than the intercept only model.
c) Which of the predictor variables has a significant effect on a
person’s systolic blood pressure at a 5% level of significance?

Variable Estimate P-value


Intercept 68.15 0.0001 𝐻0 : 𝛽0 = 0 vs 𝐻1 : 𝛽0 ≠ 0
BMI -1.23 0.0001 𝐻0 : 𝛽1 = 0 vs 𝐻1 : 𝛽1 ≠ 0
Age 0.65 0.0024 𝐻0 : 𝛽2 = 0 vs 𝐻1 : 𝛽2 ≠ 0
Waist Circum. -0.58 0.1381 𝐻0 : 𝛽3 = 0 vs 𝐻1 : 𝛽3 ≠ 0

• In multiple linear regression, similar to simple linear regression, we can


perform hypothesis tests concerning each parameter in the statistical
model (i.e. the intercept 𝛽0 and the coefficients 𝛽𝑗 ).
• By default, statistical softwares produce test-statistics and P-values for
𝐻0 : 𝛽𝑗 = 0 vs 𝐻1 : 𝛽𝑗 ≠ 0, which is used to test for the significance of the
corresponding variable.
Variable Estimate P-value
Intercept 68.15 0.0001
BMI -1.23 0.0001
Age 0.65 0.0024
Waist Circum. -0.58 0.1381

• The P-values for BMI and age are significant (both are less than 5%).
Therefore, the null hypothesis concerning their corresponding
coefficient equaling 0 is rejected. Thus, these variables have a
significant effect on the response variable.
• The P-value for waist circumference = 0.1381 > 0.05, therefore the
coefficient corresponding to this variable is not significantly different
from 0 (as 𝐻0 : 𝛽3 = 0 is not rejected). Thus, this variable does not have
a significant effect on the systolic blood pressure of an individual.
d) Interpret the results of the estimates for BMI and age.
Variable Estimate P-value
Intercept 68.15 0.0001
BMI -1.23 0.0001
Age 0.65 0.0024
Waist Circum. -0.58 0.1381
BMI:
For one unit increase in BMI, there is a 1.23 units decrease in systolic
blood pressure. Thus, BMI has a decreased effect on systolic blood
pressure (as its estimate is negative).
Age:
For one unit increase in age, there is a 0.65 units increase in systolic
blood pressure. Thus, age has an increased effect on systolic blood
pressure (as its estimate is positive).
Some notes on multiple linear regression:
• You can also explore possible interaction effects between two or
more of the explanatory/predictor variables.
• You can include the effects of qualitative/categorical variables in
addition to that of the continuous explanatory variables in the same
model. This model is known as an ANCOVA (analysis of covariance)
model.

You might also like