[go: up one dir, main page]

0% found this document useful (0 votes)
23 views48 pages

10 11 Simple Linear Regression

qt lecture note

Uploaded by

jinjiecheong2426
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views48 pages

10 11 Simple Linear Regression

qt lecture note

Uploaded by

jinjiecheong2426
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Week 10 & 11

Simple Linear Regression


Learning Objectives

❑ Introduce the basic regression analysis

❑ Determine and interpret the value of the:


▪ Slope coefficient of the independent variable.
▪ Coefficient of determination.
▪ Standard error of estimate

❑ Construct confidence intervals and carry out


hypothesis tests involving the slope of the
regression line.

WEEK10&11-2
Introduction:
Correlation vs. Regression
❑ Correlation analysis is used to measure strength of the
linear association between any two variables
▪ Correlation can only show the strength of the linear
association between any two variables
▪ No causal effect is implied with correlation analysis, i.e. It
does not tell either X affects Y, or Y affects X
▪ Interpretation of correlation coefficient is purely statistical
▪ Correlation was first presented in Chapter 1 then in Chapter 8

❑ A scatter plot (or scatter diagram) can be used to show


the association between two variables.
WEEK10&11-3
Introduction:
Correlation vs. Regression
❑ By contrast, Regression analysis can be used to:
▪ Explain the impact of changes in an independent variable (X)
on the dependent variable (Y)
▪ E.g. determine how X will affect Y
▪ Predict the value of a dependent variable (Y) based on the
value of at least one independent variable (X)

Dependent variable, Y:
▪ the variable we wish to explain
▪ E.g. Consumption

Independent variable, X:
▪ the variable used to explain the dependent variable
▪ Eg. Income WEEK10&11-4
Simple Linear Regression
Model
❑ The Simplest form of regression model
❑ Only ONE (1) independent variable, X
❑ Relationship between X and Y is described by a linear
function
❑ Changes in Y are assumed to be caused by changes in
X
❑ We always assumed X (independent variables)
affects Y (dependent variable) and not the other
way
Y Causes X
WEEK10&11-5
Simple Linear Regression
Model
The population regression model:
Population
Population Slope Independent
Dependent Random
Y-intercept Coefficient Variable
Variable Error terms

𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖 + 𝜀𝑖
Linear Component Random Error Component

WEEK10&11-6
Simple Linear Regression Model:
Graphical Illustration

Population Slope, 𝛽1 = ∆𝑌/∆𝑋


WEEK10&11-7
Simple Linear Regression Model:
Graphical Illustration

WEEK10&11-8
Estimated Simple Linear
Regression Model
The simple linear regression equation provides an estimate
of the population regression line:

𝑌෠𝑖 = 𝛽መ0 + 𝛽መ1 𝑋𝑖


𝑌෠𝑖 = estimated value of Yi or predicted Yi
𝛽መ0 = Estimate of the regression intercept, E(𝛽መ0 ) = 𝛽0
𝛽መ1 = Estimate of the regression slope, E(𝛽መ1 ) = 𝛽1

Note that the individual random error terms 𝜀𝑖 are hidden from the
model because it have a mean of zero, E(𝜀)Ƹ = 0

WEEK10&11-9
Estimated Simple Linear
Regression Model:
Interpretation of 𝛽መ0 and 𝛽መ1
𝛽መ0 is the estimated average value of Y when the
value of X is zero.

𝛽መ1 is the estimated change in the average value


of Y as a result of a one-unit change in X.

WEEK10&11-10
Estimated Simple Linear
Regression Model (continued)
  
Y Y i =  0 +  1 Xi
Observed Value of
Y for Xi

εi Slope = β1

Predicted Value Random Error for this Xi


of Y for Xi value

Intercept = β0

Xi
X
WEEK10&11-11
Types of Relationships
between X and Y

✓ Y
Linear relationships

Y
Curvilinear relationships

+ve

X X

Y Y

-ve

X X

WEEK10&11-12
Types of Relationships
between X and Y
Strong relationships Weak relationships

Y Y

X X

Y Y

X X

WEEK10&11-13
Types of Relationships
between X and Y
No relationship

WEEK10&11-14
Estimated Simple Linear
Regression Model

How to find 𝛽መ0 and 𝛽መ1 ?


→ Use the Ordinary Least Squares (OLS) method
→ It will find the best estimates for 𝛽መ0 and 𝛽መ1 so that the
sum of the squared differences between Y and 𝑌෠ is
minimized
2 2
min ෍ 𝑌𝑖 − 𝑌෠𝑖 = min ෍ 𝑌𝑖 − 𝛽መ0 + 𝛽መ1 𝑋 = min ෍ 𝜀𝑖Ƹ 2

→ The coefficients 𝛽መ0 and 𝛽መ1 can be found using


statistical software

WEEK10&11-15
Residual Analysis
▪ Residual Analysis helps you to determine whether the
regression model that has been selected is appropriate to be
used by assessing these assumptions visually.
▪ The residual for observation is the difference between the
observed value and predicted value of Y.
𝜀𝑖Ƹ = 𝑌𝑖 − 𝑌෠𝑖
▪ Check the assumptions of regression by examining the residuals
▪ Examine for linearity assumption
▪ Examine for constant variance for all levels of X (homoscedasticity)
▪ Evaluate normal distribution assumption
▪ Evaluate independence assumption
▪ Using Graphical Analysis of Residuals to determine each of the
assumptions
▪ Can plot residuals vs. X
WEEK10&11-16
Assumptions of Simple Linear
Regression Model
After we obtain the estimated regression equation and
before we analyze the result, we must make sure that
❑Normality of Error
❑ Error is normally distributed for any given value of X
❑Zero mean value of Error
❑ The average value of error terms is zero
❑Homoscedasticity
❑ The probability distribution of the errors has constant variance
❑Independence of Error
❑ Error values are statistically independent
❑Independence of Error and independent variable
❑ There is no relationship between error and independent variable
(X).
WEEK10&11-17
Normality of Error

WEEK10&11-18
Error is normally distributed with mean is
equal to zero and constant variance
(Homoscedasticity)

x x
residuals

x residuals
x

Non-constant variance
✓ Constant variance

WEEK10&11-19
Independence of Errors

Residual


Observation
Residual


Observation

WEEK10&11-20
Independence of Error and
X variable

Not Independent
✓ Independent
residuals

residuals
X
residuals

WEEK10&11-21
Simple Linear Regression
Analysis: An Example

A researcher wishes to examine the relationship between the


income per month (RM) and food expenditure per month
(RM). A random sample of 30 respondents is selected

Dependent variable (Y): Monthly Food expenditure (RM)

Independent variable (X): Monthly Income (RM)

Data are shown in the Microsoft Excel.

WEEK10&11-22
Run Regression Model Using Excel
Data / Data Analysis / Regression

WEEK10&11-23
Simple Linear Regression
Analysis: Excel Output

Regression Statistics
The regression equation is:
Multiple R 0.8531
R Square 0.7278 ෣
Food Expenditure = − 723.4509 + 0.4627(Income)
Adjusted R Square 0.7181
Standard Error 65.0494
Observations 30

ANOVA
df SS MS F Significance F
Regression 1 316755.4866 316755.5 74.85783 2.12377E-09
Residual 28 118479.9801 4231.428
Total 29 435235.4667

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept -723.4509 120.7403 -5.9918 1.87E-06 -970.7762 -476.1256
Monthly Income (RM) 0.4627 0.0535 8.6520 2.12E-09 0.3531 0.5722

WEEK10&11-24
Simple Linear Regression Equation:
Interpretation of the Intercept, 𝛽መ0


Food Expenditure = − 723.4509 + 0.4627(Income)
𝛽መ0 is the estimated average value of Y when the value of X is
zero (if X = 0 is in the range of observed X values)

▪ So 𝛽መ0 = −723.4509 indicates that for respondent who has no


income (X = 0), the total food expenditure is –RM723.4509
per month, OR
▪ –RM723.4509 is the portion of the food expenditure that is
not explained by monthly income.

Sometimes the value of 𝛽መ0 has no practical meaning, we always


include it to complete the linear regression model only.

WEEK10&11-25
Simple Linear Regression Equation:
Estimated Slope Coefficient, 𝛽መ1


Food Expenditure = − 723.4509 + 0.4627(Income)
𝛽መ1 measures the estimated change in the average value of Y as a
result of a one-unit change in X

▪ Here, 𝛽መ1 = 0.4627 implies that the average food expenditure


tend to increase by RM 0.4627 for each additional RM1
increase in monthly income.

WEEK10&11-26
Simple Linear Regression Equation:
Prediction on Y

Predict the food expenditure for a respondent that has


income RM2000 per month:


Food Expenditure = − 723.4509 + 0.4627(Income)
= −723.4509 + 0.4627(2000)
= 201.9491

The predicted food expenditure for a respondent that


has RM2000 per month is RM 201.95 per month.

WEEK10&11-27
Measures of Variation for Estimated
Simple Linear Regression Model
Total variation is made up of two parts:

SST = SSR + SSE


Total Sum of Regression Sum Error Sum of
Squares of Squares Squares

SST =  ( Yi − Y ) 2
SSR =  ( Ŷi − Y ) 2
SSE =  ( Yi − Ŷi )2
where:
𝑌ത = Average value of the dependent variable
𝑌 = Observed values of the dependent variable
𝑌෡𝑖 = Predicted value of Y for the given Xi value
WEEK10&11-28
Measures of Variation for Estimated
Simple Linear Regression Model

SST: Total Sum of Squares


❑ Measures the variation of the 𝑌𝑖 values around their mean Y.

SSR: Regression Sum of Squares


❑ Explained variation attributable to the relationship between X
and Y

SSE: Error Sum of Squares


❑ Variation attributable to factors other than the relationship
between X and Y.

WEEK10&11-29
Measures of Variation for Estimated
Simple Linear Regression Model

Excel Output:

ANOVA
df SS
Regression 1 316755.4866
Residual 28 118479.9801
Total 29 435235.4667

WEEK10&11-30
Coefficient of Determination, r2

❑ The coefficient of determination is the portion of the


total variation in the dependent variable (Y) that is
explained by variation in the independent variable
(X).
❑ The coefficient of determination is also called
r-squared and is denoted as r2
SSR regression sum of squares
r =
2
=
SST total sum of squares

0  r 1
2

WEEK10&11-31
Examples of Approximate r2 Values

r2 = 1
Perfect linear relationship
X between X and Y:
r2 = 1
Y 100% of the variation in Y is
explained by variation in X

X
r2 = 1
WEEK10&11-32
Examples of Approximate r2 Values

Y
0 < r2 < 1
Weaker linear relationships
between X and Y:
X
Some but not all of the
Y
variation in Y is explained
by variation in X

WEEK10&11-33
Examples of Approximate r2 Values

Y
r2 = 0
No linear relationship
between X and Y:

The value of Y does not


X depend on X. (None of the
r2 =0 variation in Y is explained by
variation in X)

WEEK10&11-34
Coefficient of Determination, r:
2

Excel output
Regression Statistics 𝑆𝑆𝑅 316755.4866
Multiple R 0.8531 𝑟2 = = = 0.7278
R Square 0.7278 𝑆𝑆𝑇 435235.4667
Adjusted R Square 0.7181
Standard Error 65.0494
Observations 30

ANOVA
df SS MS F Significance F
Regression 1 316755.4866 316755.5 74.85783 2.12377E-09
Residual 28 118479.9801 4231.428
Total 29 435235.4667

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept -723.4509 120.7403 -5.9918 1.87E-06 -970.7762 -476.1256
Monthly Income (RM) 0.4627 0.0535 8.6520 2.12E-09 0.3531 0.5722

WEEK10&11-35
Coefficient of Determination, r:
2

Interpretation
𝑟 2 = 0.7278

There is 72.78% of the variation for monthly food


expenditure (Y) is explained by the variation of
monthly income (X).

OR

The estimated model can explain 72.78% of the


variation for monthly food expenditure.

WEEK10&11-36
Standard Error of Estimate

• The standard error of estimate (standard deviation


of the variation of observations around the
regression line) is estimated by

SSE
SYX = = MSE
n − k −1
Where
SSE = error sum of squares
n = sample size
k = no. of independent variable (X)
WEEK10&11-37
Standard Error of Estimate:
Excel output
Regression Statistics
Multiple R
R Square
0.8531
0.7278
S YX = 65.0494
Adjusted R Square 0.7181
Standard Error 65.0494
Observations 30

ANOVA
df SS MS F Significance F
Regression 1 316755.4866 316755.5 74.85783 2.12377E-09
Residual 28 118479.9801 4231.428
Total 29 435235.4667

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept -723.4509 120.7403 -5.9918 1.87E-06 -970.7762 -476.1256
Monthly Income (RM) 0.4627 0.0535 8.6520 2.12E-09 0.3531 0.5722

WEEK10&11-38
Comparing Standard Errors of
Estimate

SYX is a measure of the variation of observed Y


values from the regression line
Y Y

X X
small sYX large sYX

WEEK10&11-39
Inference about the Slope:
Individual t-test
Is there a linear relationship between X and Y?
❑ t-test for a population slope
❑ Null and alternative hypotheses:
▪ H0: β1 = 0 (no linear relationship)
▪ H1: β1 ≠ 0 (linear relationship does exist)
❑ Test statistic:  where:
 1 − β1
t=
෢1 = regression slope coefficient
𝛽
𝛽1 = hypothesized slope
S 
1 𝑆𝛽෢1 = standard error of the slope

d.f. = n − k − 1
WEEK10&11-40
t Test for Significance of
Independent variable: Example
Is there evidence of a linear relationship between
monthly income and monthly food expenditure at the
0.05 level of significance?

Step 1

H0: β1 = 0 Step 2
H1: β1 ≠ 0 Significance level: a = 0.05

Decision Rule: Step 3


Reject H0 if test statistic is greater than upper critical
value or less than lower critical value. Otherwise do
not reject H0.
WEEK10&11-41
t Test for Significance of
Independent variable: Example
Step 4
Critical values: 𝑡𝛼, 𝑛−𝑘−1 = 𝑡0.025, 28 ± 2.048
2

From Excel output:


Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept -723.4509 120.7403 -5.9918 1.87E-06 -970.7762 -476.1256
Monthly Income (RM) 0.4627 0.0535 8.6520 2.12E-09 0.3531 0.5722

Step 5
Test statistics:
෢1 − 𝛽1 0.4627 − 0
𝛽
𝑡= = = 8.652
𝑆𝐸𝛽෢1 0.0535

WEEK10&11-42
t Test for Significance of
Independent variable: Example
Decision making: d.f. = 30-2
Reject H0 since =28
t = 8.652 > 2.048 a/2=.025 a/2=.025

Step 6

Reject H0 Do not reject H0 Reject H0


-tα/2 0 tα/2
-2.048 2.048 8.6520
Conclusion: Step 7
We have sufficient evidence to conclude that the
monthly income significantly affects monthly food
expenditure at 0.05 significance level.

WEEK10&11-43
t Test for Significance of
Independent variable: Example
Additional Step: p-value = 2.12E-09
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept -723.4509 120.7403 -5.9918 1.87E-06 -970.7762 -476.1256
Monthly Income (RM) 0.4627 0.0535 8.6520 2.12E-09 0.3531 0.5722

Interpretation for p-value:


The actual probability of committing Type I error is
2.12 ×10-9, given the t-test statistic value of 8.6520.

Decision making:
Since P-value (2.12 ×10-9) < α (0.05), so we can
reject H0.
WEEK10&11-44
Confidence Interval Estimation
for 
From Excel output:
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept -723.4509 120.7403 -5.9918 1.87E-06 -970.7762 -476.1256
Monthly Income (RM) 0.4627 0.0535 8.6520 2.12E-09 0.3531 0.5722

Interpretation:
We are 95% confidence to conclude that for an additional RM1 in
monthly income, the monthly food expenditure will increase in
between RM0.3531 and RM0.5722.

This 95% confidence interval for β1 does not include 0.


Conclusion: There is a significant relationship between monthly
food expenditure and monthly income at the 0.05 level of
significance.
WEEK10&11-45
Confidence Interval Estimation
for 
Confidence Interval Estimate of the Slope:
෢1 ± 𝑡𝛼
𝛽 𝑆𝐸 ෢1
,𝑛−𝑘−1 𝛽
2
= 0.4627 ± (2.048)(0.0535)
= (0.3531, 0.5722)

Interpretation:
We are 95% confident to conclude that for an additional
RM1 in monthly income, the monthly food expenditure
will increase in between RM0.3531 and RM0.5722.
WEEK10&11-46
Summary

❑ Introduced types of regression models

❑ Reviewed assumptions of regression and correlation

❑ Discussed determining the simple linear regression


equation

❑ Described measures of variation

❑ Discussed residual analysis

WEEK10&11-47
Summary

❑ Introduced the goodness-of-fit of model: R-square

❑ Made inference about the slope coefficient:


❑ Individual t-test
❑ Confidence interval estimate

❑ Made prediction on the value of dependent variable


using simple linear regression model

WEEK10&11-48

You might also like