[go: up one dir, main page]

0% found this document useful (0 votes)
39 views21 pages

Chapter 8 Regression Model - 2023

1) Regression models show the relationship between a dependent variable (Y) and one or more independent variables (X) through a fitted equation known as the mathematical model. 2) The goal of regression is to fit a straight line to observed data points using the method of least squares, which minimizes the sum of the squared vertical distances between the observed values (Y) and the fitted values (Ŷ). 3) The intercept and slope of the regression line help explain how the dependent variable changes in relation to the independent variable. The intercept indicates where the line crosses the y-axis, and the slope shows the rate of change of Y with respect to X.

Uploaded by

khang.nguyen1304
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views21 pages

Chapter 8 Regression Model - 2023

1) Regression models show the relationship between a dependent variable (Y) and one or more independent variables (X) through a fitted equation known as the mathematical model. 2) The goal of regression is to fit a straight line to observed data points using the method of least squares, which minimizes the sum of the squared vertical distances between the observed values (Y) and the fitted values (Ŷ). 3) The intercept and slope of the regression line help explain how the dependent variable changes in relation to the independent variable. The intercept indicates where the line crosses the y-axis, and the slope shows the rate of change of Y with respect to X.

Uploaded by

khang.nguyen1304
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

11/2/2023

Chapter 7. Regression Models

Nguyen VP Nguyen, Ph.D.


Department of Industrial & Systems Engineering, HCMUT
Email: nguyennvp@hcmut.edu.vn

Overview
• Linear association implies a straight-line relationship
• Regression is used for a purpose: the regression of Y on
X
• The difference between regression and hypothesis
testing:
 Whereas hypothesis testing is used to test parameters of one or
two populations based on a sample
 regression is used to identify a relationship between two or more
variables.

1
11/2/2023

Regression Models
• A regression model is a simplified or ideal
representation of the real world.
Latin "re-" ("back") plus "-gredior, -gredi, -gressus sum" ("go");
the "-ion" suffix is common for forming nouns.
Thus "regression" literally means "going back".

• A regression model shows how to fit a straight


line to pairs of observations on the two
variables using the method of least squares.
• All scientific inquiry is based to some extent on
models - that is the set of simplifying
assumptions - on which regression is based.

Scatterplots
• Linear means that Y is proportional to X; that a straight line can be drawn to
describe their relationship to one another.
• Which of these plots seem to show a linear relationship?
Scatterplot of Y4 vs X4
Scatterplot of Y vs X
20
140

120

?
10
100

80
0
Y4
Y

60

40 -10

20

0 -20

0 2 4 6 8 10 12 14 16
0 2 4 6 8 10 12 14 16
X4
X

a b
Scatterplot of Y3 vs X3 Scatterplot of Y2 vs X2
250 200

200 150

100
150

50
Y3

Y2

100

0
50
-50

0
-100
0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16
X3 X2

4
c d

2
11/2/2023

Regression Models
Mathematical Model (fitted equation) Statistical Model
Yt  b0  b1t The “Hats” Yt  b0  b1t  
indicate
Yt  b0  b1 X t estimated Yt  b0  b1 X t  
numbers
where
Y: dependent variable (DV)
t or X: independent variable (IV)
b0: intercept of the fitted line
b1: slope of the line

Sum of squares of error


• One measure of the error in our model is the
sum of the squares of the errors:

 
2
SSE   Yt  Yt

• The best fit line for Yt or series of Yt is obtained


by minimizing sum of squared vertical
distances from data points to the line
 This is called Least Squares Criterion or
Least Squares Method

    Y  b  b X 
2
Minimize SSE   Yt  Yt
2
t 0 1

 b 0, b 1 6

3
11/2/2023

The Intercept and Slope

• The intercept (or "constant term“ – đoạn chắn, giao điểm)


indicates where the regression line intercepts the vertical axis.
Some people call this a "shift parameter" because it "shifts" the
regression line up or down on the graph.
• The slope (độ dốc/ hệ số góc) indicates how Y changes as X
changes (e.g., if the slope is positive, as X increases, Y also
increases -- if the slope is negative, as X increases, Y
decreases). 7

Example 2 of Mr. Bump’s data

4
11/2/2023

Residual Plots for Selling Level (1000 gallons) ~Y


Normal Probability Plot Versus Fits
99 5.0

90 2.5

Residual
Percent

50 0.0

10 -2.5

1 -5.0
-5.0 -2.5 0.0 2.5 5.0 5 10 15
Residual Fitted Value

Histogram Versus Order


5.0
2.0

1.5 2.5
Frequency

Residual
1.0 0.0

0.5 -2.5

0.0 -5.0
-4 -2 0 2 4 1 2 3 4 5 6 7 8 9 10
Residual Observation Order

Decomposition of Variance
Observation = Fit + Residuals
Y  Y  Y  Y 
minimizing sum
of squared
Total
Variation vertical distances
from data points
to the line
2 Explained Variation in Y
R 
Total Variation in Y

10

5
11/2/2023

Decomposition of Variance
Observation = Fit + Residuals
Y  Y  Y  Y   Y  Y  Y  Y  Y  Y  

 (Y  Y ) 2  (Y  Y  Y  Y ) 2 
Total
 (Y  Y )   (Y  Y )   (Y  Y )
2 2 2

Sum of SST  SSR  SSE


  df = n-1
2
Squares
SST   Y  Y
Sum of
 
2
Squares of SSR   Y  Y df = 1
Regression

 
Sum of 2
Squares of SSE   Y  Y df = n-2
Errors 11

ANalysis Of Variance or ANOVA table

12

6
11/2/2023

Standard errors of the estimate

• Used to measure the variability of data points on


the fitted line 𝑌 along 𝑌 direction

• Mean square error vs standard error of the


estimate

13

Coefficient of Determination

R2

• R2=1: all of the variability in Y is explained when X is


known: The sample data points all lie on the fitted
regression line
• R2=0: none of variability in Y is explained by X

14

7
11/2/2023

Coefficient of Determinant

15

Example 6 of Mr. Bump’s data

16

8
11/2/2023

The PRESS statistic


• PRESS stands for "Prediction Sum of Squares."
 It is cross-validation used in statistical modeling,
particularly in linear regression.
 The PRESS statistic is a measure of how well a
regression model performs in predicting new data
points.
• A lower PRESS value indicates a model with
better predictive ability.
 It is useful for detecting whether a model is overfitting
the data.
 Overfitting occurs when a model is too complex and
starts to capture the random noise in the data, rather
than the underlying relationship.
17

Four Quick Checks (Simple Regression)

1. Does the model make sense (i.e., check slope


term)?
2. Is there a statistically significant relationship
between the dependent and independent
variables (t-test)?
3. What percentage of the variation in the
dependent variable does the regression model
explain (R-Square)?
4. Do the residuals violate assumptions (Analysis
of Residuals)?

19

9
11/2/2023

First Quick Check

• Does the model make sense?


• Does the slope term go with ours expectation
based on the time series plot?
• Correct model specified (no variables omitted)
• Appropriate model form (e.g., linear)

21

Second Quick Check


Are the coefficients statistically significant?
• Use a t-test to examine the null hypothesis that
the slope of the true relationship between X and Y
is equal to zero.
• The hypothesis test is: H0: b1=0, H1: b1 0
 if 𝑡 𝑡 : we accept (or fail to reject H0).
͟ Then the regression is not significant at 5% level of significant.
͟ It is saying we believe (or don’t have enough evidence to say
otherwise) that the slope of the line is zero.
͟ If the slope of the line is zero, then x tells us nothing about y.

 if 𝑡 𝑡 or 𝑡 𝑡 : Reject H0, then the


regression is significant at 5% level of significant 22

10
11/2/2023

23

Second Quick Check


Are the coefficients statistically significant?
Example 8:
.
•𝑡 4.8 𝑡
•𝑡 ,
.
.
2.306 (Check Table 3, page 485/510)
• We have 𝑡 4.8 < - 𝑡 , .  Reject H0

24

11
11/2/2023

Example 6.2

25

An alternative test on H0: F-statistic

F-statistic= Ratio of (regression mean square) to


(error mean square) has an F-distribution with df =
1,n-2

26

12
11/2/2023

An alternative test on H0: F-statistic

In F formula, MSR
H0: b1=0 is true MSR<<MSE is not larger than
MSE

In F formula, MSR
MSR>>MSE is more larger than
MSE

27

Example 9

• Look for F(1,n-2) at Table 5. p483 we have:


F(1, n-2)=F(1, 8)= 5.32  F of 23.4 > F(1,8) of
F=MSR/MSE=23.4 5.32 ??

28

13
11/2/2023

Alternative test of H0
• F <= Fdf(1,n-2) Fail to reject H0 
Regression is not significant
• F > Fdf(1,n-2)  Reject H0
Regression is significant
Note: F = t2
Example:
• F=MSR/MSE=23.4 F of 23.4 > F(1,8) of 5.32
Reject H0 at 5% level  the regression is significant
• Check F = t2  23.4 = (‐4.8)2

29

30

14
11/2/2023

Third Quick Check


We could reject H0 a
linear relation might
Large sample size, n>>100
exist regression is
even R2 small (<10%)
significant at 5% level of
significant
Small size n and large R2 Regression is significant
(>80%) at 5% level of significant

Need more sample


Very small size n (n<10) and evidence for concluding
large R2 (>80%) “regression is significant
at 5% level of significant”

31

Fourth Quick Check

1. The errors are normally distributed


2. The underlying relation (Y vs X) is linear
3. The errors have constant variance
4. The errors are independent

33

15
11/2/2023

Check the assumptions for the model

1 & 2: Histogram of residuals and Normal


probability plot
 Points appear to the straight line a good fit
data to straight line and errors has normal
distributions.
 Bell-shaped curves are expected but with
small data we could see any shapes of
histogram !!!

34

Histogram of residuals

• The histogram of the residuals shows the


distribution of the residuals for all observations.
Due to the number of intervals used to group the
data, don't use a histogram to assess the
normality of the residuals.
• A histogram is most effective when you have
approximately 20 or more data points. If the
sample is too small, then each bar on the
histogram does not contain enough data points
Pattern What the pattern may indicate
A long tailin one direction Skew ness
A bar thatis far aw ay from the other An outlier 35
bars

16
11/2/2023

Residual Plots for Sales


Normal Probability Plot Versus Fits
99 5.0

90
2.5

Residual
Percent

50 0.0

10 -2.5

1 -5.0
-5.0 -2.5 0.0 2.5 5.0 5 10 15
Residual Fitted Value

Histogram Versus Order


5.0
2.0

1.5 2.5
Frequency

Residual
1.0 0.0

0.5 -2.5

0.0 -5.0
-4 -2 0 2 4 1 2 3 4 5 6 7 8 9 10
Residual Observation Order

36

Residual Plots for Y11


Normal Probability Plot Versus Fits
99.9 4
99

90 2
Residual
Percent

50 0

10 -2
1
0.1 -4
-5.0 -2.5 0.0 2.5 5.0 60.0 62.5 65.0 67.5 70.0
Residual Fitted Value

Histogram Versus Order


16 4

12 2
Frequency

Residual

8 0

4 -2

0 -4
-3 -2 -1 0 1 2 3 1 10 20 30 40 50 60 70 80 90 100
Residual Observation Order

37

17
11/2/2023

Check the assumptions for the model

3. The errors have constant variance

Residuals vs fitted values


— Curved relationships- need to transform data in
order to standardize variance
— Increasing the magnitude of fitted values-) not
constant variance-) transform log of Y to X in order to
hold constant variance
— See Page.197 (202/510) Textbook.

38

Summary: assumptions of regression model


• Several assumptions are needed to fit the
regression model using the method described.
 Errors are uncorrelated random variables with
constant variance
– zero mean
– homoscedastic (constant variance)
– mutually independent (non-autocorrelated)
 If we test hypotheses or create confidence intervals,
we also need the errors to be normally distributed.
 We are assuming that the linear model is correct; that
y does not vary with any higher (or lower) power of x.
• These assumptions need to be checked!
39

18
11/2/2023

Summary: assumptions of regression model


• To check these assumptions, use the following
methods:
 Save residuals when running a regression (we will
check for autocorrelation).
 Scatterplot of data (see if linear model is correct)
 Scatterplot of residuals (a.k.a. “residual analysis,”
checks correlation of residuals, non-constant variance,
normality, and linearity assumptions – use Minitab
“four-in-one” plot)

40

Pattern What the pattern may indicate

Fanning or uneven spreading of


Nonconstant variance
residuals across fitted values
Curvilinear A missing higher‐order term
A point that is far away from
An outlier
zero
A point that is far away from
the other points in the x‐ An influential point
direction

41

19
11/2/2023

there are too many outliers, the The variance of the residuals increases
model may not be acceptable. You with the fitted values. Notice that, as the
should try to identify the cause of value of the fits increases, the scatter
any outlier. among the residuals widens. This pattern
indicates that the variances of the
residuals are unequal (nonconstant).

42

• If you identify any patterns or outliers in your residual


versus fits plot, consider the following solutions:
Issue Possible solution
Consider using Fit Regression Model with a Box‐
Nonconstant variance
Cox transformation or weights.
1.Verify that the observation is not a
measurement error or data‐entry error.
Consider removing data values that are
associated with abnormal, one‐time events
An outlier or influential point
(special causes).
2. Then, repeat the analysis without this
observation to determine how it impacts your
results.

43

20
11/2/2023

Check the assumptions for the model

4. The errors are independent

Residuals vs order of data

— Check error independency by observing the


graph and calculate rk (k=n/4)

— Check if there is no systematic patterns and


ACF of residuals are uniformly small.

44

Assigment

45

21

You might also like