11/2/2023
Chapter 7. Regression Models
Nguyen VP Nguyen, Ph.D.
Department of Industrial & Systems Engineering, HCMUT
Email: nguyennvp@hcmut.edu.vn
Overview
• Linear association implies a straight-line relationship
• Regression is used for a purpose: the regression of Y on
X
• The difference between regression and hypothesis
testing:
Whereas hypothesis testing is used to test parameters of one or
two populations based on a sample
regression is used to identify a relationship between two or more
variables.
1
11/2/2023
Regression Models
• A regression model is a simplified or ideal
representation of the real world.
Latin "re-" ("back") plus "-gredior, -gredi, -gressus sum" ("go");
the "-ion" suffix is common for forming nouns.
Thus "regression" literally means "going back".
• A regression model shows how to fit a straight
line to pairs of observations on the two
variables using the method of least squares.
• All scientific inquiry is based to some extent on
models - that is the set of simplifying
assumptions - on which regression is based.
Scatterplots
• Linear means that Y is proportional to X; that a straight line can be drawn to
describe their relationship to one another.
• Which of these plots seem to show a linear relationship?
Scatterplot of Y4 vs X4
Scatterplot of Y vs X
20
140
120
?
10
100
80
0
Y4
Y
60
40 -10
20
0 -20
0 2 4 6 8 10 12 14 16
0 2 4 6 8 10 12 14 16
X4
X
a b
Scatterplot of Y3 vs X3 Scatterplot of Y2 vs X2
250 200
200 150
100
150
50
Y3
Y2
100
0
50
-50
0
-100
0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16
X3 X2
4
c d
2
11/2/2023
Regression Models
Mathematical Model (fitted equation) Statistical Model
Yt b0 b1t The “Hats” Yt b0 b1t
indicate
Yt b0 b1 X t estimated Yt b0 b1 X t
numbers
where
Y: dependent variable (DV)
t or X: independent variable (IV)
b0: intercept of the fitted line
b1: slope of the line
Sum of squares of error
• One measure of the error in our model is the
sum of the squares of the errors:
2
SSE Yt Yt
• The best fit line for Yt or series of Yt is obtained
by minimizing sum of squared vertical
distances from data points to the line
This is called Least Squares Criterion or
Least Squares Method
Y b b X
2
Minimize SSE Yt Yt
2
t 0 1
b 0, b 1 6
3
11/2/2023
The Intercept and Slope
• The intercept (or "constant term“ – đoạn chắn, giao điểm)
indicates where the regression line intercepts the vertical axis.
Some people call this a "shift parameter" because it "shifts" the
regression line up or down on the graph.
• The slope (độ dốc/ hệ số góc) indicates how Y changes as X
changes (e.g., if the slope is positive, as X increases, Y also
increases -- if the slope is negative, as X increases, Y
decreases). 7
Example 2 of Mr. Bump’s data
4
11/2/2023
Residual Plots for Selling Level (1000 gallons) ~Y
Normal Probability Plot Versus Fits
99 5.0
90 2.5
Residual
Percent
50 0.0
10 -2.5
1 -5.0
-5.0 -2.5 0.0 2.5 5.0 5 10 15
Residual Fitted Value
Histogram Versus Order
5.0
2.0
1.5 2.5
Frequency
Residual
1.0 0.0
0.5 -2.5
0.0 -5.0
-4 -2 0 2 4 1 2 3 4 5 6 7 8 9 10
Residual Observation Order
Decomposition of Variance
Observation = Fit + Residuals
Y Y Y Y
minimizing sum
of squared
Total
Variation vertical distances
from data points
to the line
2 Explained Variation in Y
R
Total Variation in Y
10
5
11/2/2023
Decomposition of Variance
Observation = Fit + Residuals
Y Y Y Y Y Y Y Y Y Y
(Y Y ) 2 (Y Y Y Y ) 2
Total
(Y Y ) (Y Y ) (Y Y )
2 2 2
Sum of SST SSR SSE
df = n-1
2
Squares
SST Y Y
Sum of
2
Squares of SSR Y Y df = 1
Regression
Sum of 2
Squares of SSE Y Y df = n-2
Errors 11
ANalysis Of Variance or ANOVA table
12
6
11/2/2023
Standard errors of the estimate
• Used to measure the variability of data points on
the fitted line 𝑌 along 𝑌 direction
• Mean square error vs standard error of the
estimate
13
Coefficient of Determination
R2
• R2=1: all of the variability in Y is explained when X is
known: The sample data points all lie on the fitted
regression line
• R2=0: none of variability in Y is explained by X
14
7
11/2/2023
Coefficient of Determinant
15
Example 6 of Mr. Bump’s data
16
8
11/2/2023
The PRESS statistic
• PRESS stands for "Prediction Sum of Squares."
It is cross-validation used in statistical modeling,
particularly in linear regression.
The PRESS statistic is a measure of how well a
regression model performs in predicting new data
points.
• A lower PRESS value indicates a model with
better predictive ability.
It is useful for detecting whether a model is overfitting
the data.
Overfitting occurs when a model is too complex and
starts to capture the random noise in the data, rather
than the underlying relationship.
17
Four Quick Checks (Simple Regression)
1. Does the model make sense (i.e., check slope
term)?
2. Is there a statistically significant relationship
between the dependent and independent
variables (t-test)?
3. What percentage of the variation in the
dependent variable does the regression model
explain (R-Square)?
4. Do the residuals violate assumptions (Analysis
of Residuals)?
19
9
11/2/2023
First Quick Check
• Does the model make sense?
• Does the slope term go with ours expectation
based on the time series plot?
• Correct model specified (no variables omitted)
• Appropriate model form (e.g., linear)
21
Second Quick Check
Are the coefficients statistically significant?
• Use a t-test to examine the null hypothesis that
the slope of the true relationship between X and Y
is equal to zero.
• The hypothesis test is: H0: b1=0, H1: b1 0
if 𝑡 𝑡 : we accept (or fail to reject H0).
͟ Then the regression is not significant at 5% level of significant.
͟ It is saying we believe (or don’t have enough evidence to say
otherwise) that the slope of the line is zero.
͟ If the slope of the line is zero, then x tells us nothing about y.
if 𝑡 𝑡 or 𝑡 𝑡 : Reject H0, then the
regression is significant at 5% level of significant 22
10
11/2/2023
23
Second Quick Check
Are the coefficients statistically significant?
Example 8:
.
•𝑡 4.8 𝑡
•𝑡 ,
.
.
2.306 (Check Table 3, page 485/510)
• We have 𝑡 4.8 < - 𝑡 , . Reject H0
24
11
11/2/2023
Example 6.2
25
An alternative test on H0: F-statistic
F-statistic= Ratio of (regression mean square) to
(error mean square) has an F-distribution with df =
1,n-2
26
12
11/2/2023
An alternative test on H0: F-statistic
In F formula, MSR
H0: b1=0 is true MSR<<MSE is not larger than
MSE
In F formula, MSR
MSR>>MSE is more larger than
MSE
27
Example 9
• Look for F(1,n-2) at Table 5. p483 we have:
F(1, n-2)=F(1, 8)= 5.32 F of 23.4 > F(1,8) of
F=MSR/MSE=23.4 5.32 ??
28
13
11/2/2023
Alternative test of H0
• F <= Fdf(1,n-2) Fail to reject H0
Regression is not significant
• F > Fdf(1,n-2) Reject H0
Regression is significant
Note: F = t2
Example:
• F=MSR/MSE=23.4 F of 23.4 > F(1,8) of 5.32
Reject H0 at 5% level the regression is significant
• Check F = t2 23.4 = (‐4.8)2
29
30
14
11/2/2023
Third Quick Check
We could reject H0 a
linear relation might
Large sample size, n>>100
exist regression is
even R2 small (<10%)
significant at 5% level of
significant
Small size n and large R2 Regression is significant
(>80%) at 5% level of significant
Need more sample
Very small size n (n<10) and evidence for concluding
large R2 (>80%) “regression is significant
at 5% level of significant”
31
Fourth Quick Check
1. The errors are normally distributed
2. The underlying relation (Y vs X) is linear
3. The errors have constant variance
4. The errors are independent
33
15
11/2/2023
Check the assumptions for the model
1 & 2: Histogram of residuals and Normal
probability plot
Points appear to the straight line a good fit
data to straight line and errors has normal
distributions.
Bell-shaped curves are expected but with
small data we could see any shapes of
histogram !!!
34
Histogram of residuals
• The histogram of the residuals shows the
distribution of the residuals for all observations.
Due to the number of intervals used to group the
data, don't use a histogram to assess the
normality of the residuals.
• A histogram is most effective when you have
approximately 20 or more data points. If the
sample is too small, then each bar on the
histogram does not contain enough data points
Pattern What the pattern may indicate
A long tailin one direction Skew ness
A bar thatis far aw ay from the other An outlier 35
bars
16
11/2/2023
Residual Plots for Sales
Normal Probability Plot Versus Fits
99 5.0
90
2.5
Residual
Percent
50 0.0
10 -2.5
1 -5.0
-5.0 -2.5 0.0 2.5 5.0 5 10 15
Residual Fitted Value
Histogram Versus Order
5.0
2.0
1.5 2.5
Frequency
Residual
1.0 0.0
0.5 -2.5
0.0 -5.0
-4 -2 0 2 4 1 2 3 4 5 6 7 8 9 10
Residual Observation Order
36
Residual Plots for Y11
Normal Probability Plot Versus Fits
99.9 4
99
90 2
Residual
Percent
50 0
10 -2
1
0.1 -4
-5.0 -2.5 0.0 2.5 5.0 60.0 62.5 65.0 67.5 70.0
Residual Fitted Value
Histogram Versus Order
16 4
12 2
Frequency
Residual
8 0
4 -2
0 -4
-3 -2 -1 0 1 2 3 1 10 20 30 40 50 60 70 80 90 100
Residual Observation Order
37
17
11/2/2023
Check the assumptions for the model
3. The errors have constant variance
Residuals vs fitted values
— Curved relationships- need to transform data in
order to standardize variance
— Increasing the magnitude of fitted values-) not
constant variance-) transform log of Y to X in order to
hold constant variance
— See Page.197 (202/510) Textbook.
38
Summary: assumptions of regression model
• Several assumptions are needed to fit the
regression model using the method described.
Errors are uncorrelated random variables with
constant variance
– zero mean
– homoscedastic (constant variance)
– mutually independent (non-autocorrelated)
If we test hypotheses or create confidence intervals,
we also need the errors to be normally distributed.
We are assuming that the linear model is correct; that
y does not vary with any higher (or lower) power of x.
• These assumptions need to be checked!
39
18
11/2/2023
Summary: assumptions of regression model
• To check these assumptions, use the following
methods:
Save residuals when running a regression (we will
check for autocorrelation).
Scatterplot of data (see if linear model is correct)
Scatterplot of residuals (a.k.a. “residual analysis,”
checks correlation of residuals, non-constant variance,
normality, and linearity assumptions – use Minitab
“four-in-one” plot)
40
Pattern What the pattern may indicate
Fanning or uneven spreading of
Nonconstant variance
residuals across fitted values
Curvilinear A missing higher‐order term
A point that is far away from
An outlier
zero
A point that is far away from
the other points in the x‐ An influential point
direction
41
19
11/2/2023
there are too many outliers, the The variance of the residuals increases
model may not be acceptable. You with the fitted values. Notice that, as the
should try to identify the cause of value of the fits increases, the scatter
any outlier. among the residuals widens. This pattern
indicates that the variances of the
residuals are unequal (nonconstant).
42
• If you identify any patterns or outliers in your residual
versus fits plot, consider the following solutions:
Issue Possible solution
Consider using Fit Regression Model with a Box‐
Nonconstant variance
Cox transformation or weights.
1.Verify that the observation is not a
measurement error or data‐entry error.
Consider removing data values that are
associated with abnormal, one‐time events
An outlier or influential point
(special causes).
2. Then, repeat the analysis without this
observation to determine how it impacts your
results.
43
20
11/2/2023
Check the assumptions for the model
4. The errors are independent
Residuals vs order of data
— Check error independency by observing the
graph and calculate rk (k=n/4)
— Check if there is no systematic patterns and
ACF of residuals are uniformly small.
44
Assigment
45
21