[go: up one dir, main page]

0% found this document useful (0 votes)
21 views54 pages

Week 2

The document covers regression modeling concepts, including the estimated regression function, error term variance estimation, normal error regression models, and the analysis of variance (ANOVA) approach. Key topics include properties of fitted regression lines, residuals, and the F test for hypothesis testing. An example using R demonstrates the application of these concepts with a dataset from the Toluca Company.

Uploaded by

Harry Lu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views54 pages

Week 2

The document covers regression modeling concepts, including the estimated regression function, error term variance estimation, normal error regression models, and the analysis of variance (ANOVA) approach. Key topics include properties of fitted regression lines, residuals, and the F test for hypothesis testing. An example using R demonstrates the application of these concepts with a dataset from the Toluca Company.

Uploaded by

Harry Lu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Regression Modelling

Week 2

Week 2 Regression Modelling 1 / 53


1 Estimated regression function (Ch 1.6)

2 Estimation of error terms variance σ 2 (Ch 1.7)

3 Normal error regression model (Ch 1.8)

4 Analysis of variance approach (ANOVA) (Ch 2.7)

5 An example in R

Week 2 Regression Modelling 2 / 53


Section 1

Estimated regression function (Ch 1.6)

Week 2 Regression Modelling 3 / 53


Estimated regression function

Ŷ = b0 + b1 X , where
Sxy
b1 = , b0 = Ȳ − b1 X̄
Sxx

The fitted value for the ith case: Ŷi


The observed value for the ith case: Yi

Week 2 Regression Modelling 4 / 53


Residuals

The ith residual is the difference between the observed value Yi and
the corresponding fitted value Ŷi .
ei = Yi − Ŷi
For our model, the residuals becomes

ei = Yi − (b0 + b1 Xi)

Week 2 Regression Modelling 5 / 53


Residuals

Do not confuse
i = Yi − E (Yi ) "Model error"
ei = Yi − Ŷi "Residual"

i : deviation from the unknown true regression line


ei : deviation from the estimated regression line

Week 2 Regression Modelling 6 / 53


Residuals

Week 2 Regression Modelling 7 / 53


Properties of fitted regression line

Pn
1 The sum of residuals is zero: i=1 ei = 0.

Week 2 Regression Modelling 8 / 53


Properties of fitted regression line

Pn 2
2. The sum of squared residuals is a minimum: i=1 ei .

Week 2 Regression Modelling 9 / 53


Properties of fitted regression line

3. The sum of the observed values Yi equals the sum of the fitted values
Ŷi .
n
X n
X
Yi = Ŷi
i=1 i=1

Week 2 Regression Modelling 10 / 53


Properties of fitted regression line

4. The sum of the weighted residuals is zero when the residual in the ith
trial is weighted by the level of the predictor variable in the ith trial:
n
X
Xi ei = 0
i=1

Week 2 Regression Modelling 11 / 53


Properties of fitted regression line

5. the sum ofthe weighted residuals is zero when the residual in the ith
trial is weighted by the fitted value ofthe response variable for the ith
trial:
n
X
Ŷi ei = 0
i=1

Week 2 Regression Modelling 12 / 53


Properties of fitted regression line

6. The regression line always goes through the point (X̄ , Ȳ )

Week 2 Regression Modelling 13 / 53


Section 2

Estimation of error terms variance σ 2 (Ch 1.7)

Week 2 Regression Modelling 14 / 53


Estimation of error terms variance σ 2 (Ch 1.7)

Recall: Estimate σ 2 for a single population

Pn
2 − Ȳi )
i=1 (Yi
s =
n−1

Week 2 Regression Modelling 15 / 53


Estimation of error terms variance σ 2

Estimate σ 2 for the regression model

Var (Yi ) = Var (i ) = σ 2

Use ei = Yi − Ŷi
Pn 2
i=1 ei
Var (ei ) =
DF

Week 2 Regression Modelling 16 / 53


Estimation of error terms variance σ 2

SSE: Residual sum of square


MSE: Residual mean square

Pn
2 SSE − Ŷi )2
i=1 (Yi
s = MSE = =
n−2 n−2

s= MSE
It can be shown that MSE is an unbiased estimator of σ 2 .

E (MSE ) = σ 2

Week 2 Regression Modelling 17 / 53


Section 3

Normal error regression model (Ch 1.8)

Week 2 Regression Modelling 18 / 53


Normal error regression model (Ch 1.8)

Yi = β0 + β1 Xi + i
No matter what form of the distribution of i , the least squares
method provides unbiased point estimators of β0 and β1 that have
minimum variance among all unbiased linear estimators.
To set up interval estimates and make tests, we assume:
The error i ∼iid N (0, σ 2 )

Week 2 Regression Modelling 19 / 53


Section 4

Analysis of variance approach (ANOVA) (Ch 2.7)

Week 2 Regression Modelling 20 / 53


Analysis of variance approach (ANOVA) (Ch 2.7)

We consider the regression analysis from the perspective of analysis of


variance
Useful in multiple regression

Week 2 Regression Modelling 21 / 53


Partitioning of total sum of squares

The analysis of variance approach is based on the partitioning of sums of


squares and degrees of freedom associated with the response variable Y .
Consider one single random variable Y

Week 2 Regression Modelling 22 / 53


Partitioning of total sum of squares

Now consider in a linear regression model, where Y is related to X

Week 2 Regression Modelling 23 / 53


Partitioning of total sum of squares

Total sum of squares


n
X
SSTO = (Yi − Ȳ )2
i=1

Regression sum of squares


n
X
SSR = (Ŷi − Ȳ )2
i=1

Error (Residual) sum of squares


n
X
SSE = (Yi − Ŷi )2
i=1

Week 2 Regression Modelling 24 / 53


Formal development of partitioning

We can see easily:

Yi − Ȳ = Ŷi − Ȳ + Yi − Ŷi

1 The deviation of the fitted value Ŷi around the mean Ȳ


2 The deviation of the observation Yi around the fitted line Ŷi
The sum of squares also have the same relationship
n
X n
X n
X
2 2
(Yi − Ȳ ) = (Ŷi − Ȳ ) + (Yi − Ŷi )2 ,
i=1 i=1 i=1

or

SSTO = SSR + SSE

Week 2 Regression Modelling 25 / 53


Formal development of partitioning

Week 2 Regression Modelling 26 / 53


Breakdown of degrees of freedom

SSTO has n − 1 degrees of freedom

SSE has n − 2 degrees of freedom

SSR has 1 degree of freedom

n − 1 = (n − 2) + 1

Week 2 Regression Modelling 27 / 53


Mean squares
A sum of squares devided by its associated degrees of freedom is called a
mean square
Sample variance of Y is a mean square

Regression mean square


SSR
MSR = = SSR
1

Error (residual) mean square

SSE
MSE =
n−2

Mean squares are not additive


Week 2 Regression Modelling 28 / 53
ANOVA table

Source of variation SS df MS
SSR
SSR = (Ŷi − Ȳ )2
P
Regression 1 MSR = 1

SSE
(Ŷi − Ȳ )2
P
Error SSE= n-2 MSE = n−2

(Yi − Ȳ )2
P
Total SSTO= n-1

Week 2 Regression Modelling 29 / 53


Expected mean squares

MSE and MSR are random variables, we have

E (MSE ) = σ 2
E (MSR) = σ 2 + β12 Sxx

When β1 = 0, the means of the sampling distribution of MSE and


MSR are the same;
When β1 6= 0, the mean of the sampling distribution of MSR is larger
than MSE.
Comparing MSR and MSE should be useful for testing if β1 = 0.

Week 2 Regression Modelling 30 / 53


F test

To test

H0 : β1 = 0
Ha : β1 6= 0

We can use the test statistic:

MSR
F∗ =
MSE

What’s the distribution of F ∗ under the null hypothesis?

Week 2 Regression Modelling 31 / 53


F test

It can be proved that when β1 = 0


SSE
σ2
is distributed as χ2n−2 ;
SSR
σ2
is distributed as χ21 .
We also know that
With two independent χ2 distributed random variables Z1 and Z2 , with
degrees of freedome df1 and df2 , the ratio

Z1 /df1
Z2 /df2

will follow an F distribution with (df1 , df2 ) degrees of freedom.

Week 2 Regression Modelling 32 / 53


F test - decision rule

This is an upper-tail test. (why?)


With a significance level α, we reject H0 when

F ∗ > F (1 − α; 1, n − 2),

where F (1 − α; 1, n − 2) is the (1 − α)100 percentile of the F


distribution.

Week 2 Regression Modelling 33 / 53


Coefficient of determination (R 2 )

The coefficient of determination


SSR
R2 =
SSTO

measures the proportion of total variation in Y that can be explained


by the fitted regression model
0 ≤ R2 ≤ 1
In SLR, R 2 = r 2 , where r is the coefficient of correlation.

Week 2 Regression Modelling 34 / 53


Section 5

An example in R

Week 2 Regression Modelling 35 / 53


Toluca Company example from the textbook

Table 1.1 page 19


Use dataset from the R package “ALSM”
Or download from Wattle “Kutner Textbook Datasets”, file named
“CH01TA01.txt”
# install.packages("ALSM")
library("ALSM")
mydata <- TolucaCompany
# mydata <- read.table("CH01TA01.txt")
# need to put the data file into your working directory first
X <- mydata[,1]
Y <- mydata[,2]

X = “Lot Size” and Y = “Hours Worked”

Week 2 Regression Modelling 36 / 53


Scatter plot
#plot(mydata)
plot(X,Y, col="red", pch=17, xlab="Lot size", cex.lab=1.5,
ylab = "Work hours", main = "Toluca Company")
Toluca Company
500
400
Work hours
300
200
100

20 40 60 80 100 120

Lot size
Week 2 Regression Modelling 37 / 53
Summary statistics

summary(mydata)

## x y
## Min. : 20 Min. :113.0
## 1st Qu.: 50 1st Qu.:224.0
## Median : 70 Median :342.0
## Mean : 70 Mean :312.3
## 3rd Qu.: 90 3rd Qu.:389.0
## Max. :120 Max. :546.0

Week 2 Regression Modelling 38 / 53


Summary statistics
boxplot(mydata)
500
400
300
200
100
0

x y
Week 2 Regression Modelling 39 / 53
Fit the SLM manually

Recall we have

(Xi − X̄ )(Yi − Ȳ )
P
Sxy
b1 = =
Sxx (Xi − X̄ )2
P

b0 = Ȳ − b1 X̄

Xbar <- mean(X)


Ybar <- mean(Y)

Week 2 Regression Modelling 40 / 53


Fit the SLM manually

Xcenter <- X - Xbar


Ycenter <- Y - Ybar
Sxy <- crossprod(Xcenter, Ycenter)
# can also use
# Sxy <- sum(Xcenter*Ycenter)
# Sxy <- t(Xcenter)%*%Ycenter
Xcenter

## [1] 10 -40 -20 20 0 -10 50 10 30 -20 -30 0 20 -5


## [18] -20 20 40 -40 20 -30 10 0
Sxy

## [,1]
## [1,] 70690
# You can calculate Sxx similarly

Week 2 Regression Modelling 41 / 53


Fit the SLM manually
Sxx <- crossprod(Xcenter)
# Sxy <- sum(Xcenter^2)
Sxx

## [,1]
## [1,] 19800
b1 <- Sxy/Sxx
b0 <- Ybar - b1*Xbar
b0

## [,1]
## [1,] 62.36586
b1

## [,1]
## [1,] 3.570202
Week 2 Regression Modelling 42 / 53
Fit the SLM manually

Another way to calculate b1 ,


rxy sy
b1 = ,
sx
where rxy is the sample correlation between X and Y .

Week 2 Regression Modelling 43 / 53


Fit the SLM manually

b1 <- cor(X, Y)*sd(Y)/sd(X)


b1

## [1] 3.570202

Week 2 Regression Modelling 44 / 53


Fitting with “lm” function

mymodel <- lm(Y ~ X)


# without intercept: lm(Y ~ X -1)
# without slope: lm(Y ~ 1)
summary(mymodel)

Week 2 Regression Modelling 45 / 53


Fitting with “lm” function
##
## Call:
## lm(formula = Y ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -83.876 -34.088 -5.982 38.826 103.528
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.366 26.177 2.382 0.0259 *
## X 3.570 0.347 10.290 4.45e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 48.82 on 23 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8138
## F-statistic: 105.9 on 1 and 23 DF, p-value: 4.449e-10
Week 2 Regression Modelling 46 / 53
The estimated regression line
Ŷi = 62.366 + 3.570Xi
plot(X,Y, pch = 16)
abline(mymodel, col="purple", lty=2, cex=1.5, lwd=2)
500
400
Y

300
200
100

20 40 60 80 100 120

Week 2 Regression Modelling 47 / 53


Fitted Y values

Yhat <- b0 + b1*X


Yfit <- mymodel$fitted.values
round(Yhat)

## [1] 348 169 241 384 312 277 491 348 419 241 205 312 384 13
## [18] 241 384 455 169 384 205 348 312
round(Yfit)

## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 348 169 241 384 312 277 491 348 419 241 205 312 384 134 455
## 19 20 21 22 23 24 25
## 384 455 169 384 205 348 312

Week 2 Regression Modelling 48 / 53


Residuals
Res <- Y - Yhat
Res <- mymodel$residuals
SSE <- sum(Res^2)
SSE

## [1] 54825.46
n = length(Y)
MSE <- SSE/(n-2)
MSE

## [1] 2383.716
# Estimate sigma
sigma_hat = sqrt(MSE)
sigma_hat

## [1] 48.82331
# This is also called "Residual standard error"
Week 2 Regression Modelling 49 / 53
ANOVA - manually

# Total sum of squares


SSTO <- sum((Y - Ybar)^2)
SSTO

## [1] 307203
# Regression sum of squares
SSR <- sum((Yhat - Ybar)^2)
SSR

## [1] 252377.6
SSTO-SSR

## [1] 54825.46
# Regression mean square
MSR <- SSR/1

Week 2 Regression Modelling 50 / 53


ANOVA - F test

Fstat <- MSR/MSE


Fstat

## [1] 105.8757
critical <- qf(0.95, 1, n-2)
critical

## [1] 4.279344
pvalue <- 1- pf(Fstat, 1, n-2)
pvalue

## [1] 4.448828e-10

Week 2 Regression Modelling 51 / 53


ANOVA - by R function

anova(mymodel)

## Analysis of Variance Table


##
## Response: Y
## Df Sum Sq Mean Sq F value Pr(>F)
## X 1 252378 252378 105.88 4.449e-10 ***
## Residuals 23 54825 2384
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Week 2 Regression Modelling 52 / 53


Coefficient of Determination

SSR
R2 = SSTO
Rsqr <- SSR/SSTO
Rsqr

## [1] 0.8215335
Look at the summary output
Check with coefficient of correlation
cor(X, Y)

## [1] 0.9063848
cor(X, Y)^2

## [1] 0.8215335

Week 2 Regression Modelling 53 / 53

You might also like