Regression Modelling
Week 2
Week 2 Regression Modelling 1 / 53
1 Estimated regression function (Ch 1.6)
2 Estimation of error terms variance σ 2 (Ch 1.7)
3 Normal error regression model (Ch 1.8)
4 Analysis of variance approach (ANOVA) (Ch 2.7)
5 An example in R
Week 2 Regression Modelling 2 / 53
Section 1
Estimated regression function (Ch 1.6)
Week 2 Regression Modelling 3 / 53
Estimated regression function
Ŷ = b0 + b1 X , where
Sxy
b1 = , b0 = Ȳ − b1 X̄
Sxx
The fitted value for the ith case: Ŷi
The observed value for the ith case: Yi
Week 2 Regression Modelling 4 / 53
Residuals
The ith residual is the difference between the observed value Yi and
the corresponding fitted value Ŷi .
ei = Yi − Ŷi
For our model, the residuals becomes
ei = Yi − (b0 + b1 Xi)
Week 2 Regression Modelling 5 / 53
Residuals
Do not confuse
i = Yi − E (Yi ) "Model error"
ei = Yi − Ŷi "Residual"
i : deviation from the unknown true regression line
ei : deviation from the estimated regression line
Week 2 Regression Modelling 6 / 53
Residuals
Week 2 Regression Modelling 7 / 53
Properties of fitted regression line
Pn
1 The sum of residuals is zero: i=1 ei = 0.
Week 2 Regression Modelling 8 / 53
Properties of fitted regression line
Pn 2
2. The sum of squared residuals is a minimum: i=1 ei .
Week 2 Regression Modelling 9 / 53
Properties of fitted regression line
3. The sum of the observed values Yi equals the sum of the fitted values
Ŷi .
n
X n
X
Yi = Ŷi
i=1 i=1
Week 2 Regression Modelling 10 / 53
Properties of fitted regression line
4. The sum of the weighted residuals is zero when the residual in the ith
trial is weighted by the level of the predictor variable in the ith trial:
n
X
Xi ei = 0
i=1
Week 2 Regression Modelling 11 / 53
Properties of fitted regression line
5. the sum ofthe weighted residuals is zero when the residual in the ith
trial is weighted by the fitted value ofthe response variable for the ith
trial:
n
X
Ŷi ei = 0
i=1
Week 2 Regression Modelling 12 / 53
Properties of fitted regression line
6. The regression line always goes through the point (X̄ , Ȳ )
Week 2 Regression Modelling 13 / 53
Section 2
Estimation of error terms variance σ 2 (Ch 1.7)
Week 2 Regression Modelling 14 / 53
Estimation of error terms variance σ 2 (Ch 1.7)
Recall: Estimate σ 2 for a single population
Pn
2 − Ȳi )
i=1 (Yi
s =
n−1
Week 2 Regression Modelling 15 / 53
Estimation of error terms variance σ 2
Estimate σ 2 for the regression model
Var (Yi ) = Var (i ) = σ 2
Use ei = Yi − Ŷi
Pn 2
i=1 ei
Var (ei ) =
DF
Week 2 Regression Modelling 16 / 53
Estimation of error terms variance σ 2
SSE: Residual sum of square
MSE: Residual mean square
Pn
2 SSE − Ŷi )2
i=1 (Yi
s = MSE = =
n−2 n−2
√
s= MSE
It can be shown that MSE is an unbiased estimator of σ 2 .
E (MSE ) = σ 2
Week 2 Regression Modelling 17 / 53
Section 3
Normal error regression model (Ch 1.8)
Week 2 Regression Modelling 18 / 53
Normal error regression model (Ch 1.8)
Yi = β0 + β1 Xi + i
No matter what form of the distribution of i , the least squares
method provides unbiased point estimators of β0 and β1 that have
minimum variance among all unbiased linear estimators.
To set up interval estimates and make tests, we assume:
The error i ∼iid N (0, σ 2 )
Week 2 Regression Modelling 19 / 53
Section 4
Analysis of variance approach (ANOVA) (Ch 2.7)
Week 2 Regression Modelling 20 / 53
Analysis of variance approach (ANOVA) (Ch 2.7)
We consider the regression analysis from the perspective of analysis of
variance
Useful in multiple regression
Week 2 Regression Modelling 21 / 53
Partitioning of total sum of squares
The analysis of variance approach is based on the partitioning of sums of
squares and degrees of freedom associated with the response variable Y .
Consider one single random variable Y
Week 2 Regression Modelling 22 / 53
Partitioning of total sum of squares
Now consider in a linear regression model, where Y is related to X
Week 2 Regression Modelling 23 / 53
Partitioning of total sum of squares
Total sum of squares
n
X
SSTO = (Yi − Ȳ )2
i=1
Regression sum of squares
n
X
SSR = (Ŷi − Ȳ )2
i=1
Error (Residual) sum of squares
n
X
SSE = (Yi − Ŷi )2
i=1
Week 2 Regression Modelling 24 / 53
Formal development of partitioning
We can see easily:
Yi − Ȳ = Ŷi − Ȳ + Yi − Ŷi
1 The deviation of the fitted value Ŷi around the mean Ȳ
2 The deviation of the observation Yi around the fitted line Ŷi
The sum of squares also have the same relationship
n
X n
X n
X
2 2
(Yi − Ȳ ) = (Ŷi − Ȳ ) + (Yi − Ŷi )2 ,
i=1 i=1 i=1
or
SSTO = SSR + SSE
Week 2 Regression Modelling 25 / 53
Formal development of partitioning
Week 2 Regression Modelling 26 / 53
Breakdown of degrees of freedom
SSTO has n − 1 degrees of freedom
SSE has n − 2 degrees of freedom
SSR has 1 degree of freedom
n − 1 = (n − 2) + 1
Week 2 Regression Modelling 27 / 53
Mean squares
A sum of squares devided by its associated degrees of freedom is called a
mean square
Sample variance of Y is a mean square
Regression mean square
SSR
MSR = = SSR
1
Error (residual) mean square
SSE
MSE =
n−2
Mean squares are not additive
Week 2 Regression Modelling 28 / 53
ANOVA table
Source of variation SS df MS
SSR
SSR = (Ŷi − Ȳ )2
P
Regression 1 MSR = 1
SSE
(Ŷi − Ȳ )2
P
Error SSE= n-2 MSE = n−2
(Yi − Ȳ )2
P
Total SSTO= n-1
Week 2 Regression Modelling 29 / 53
Expected mean squares
MSE and MSR are random variables, we have
E (MSE ) = σ 2
E (MSR) = σ 2 + β12 Sxx
When β1 = 0, the means of the sampling distribution of MSE and
MSR are the same;
When β1 6= 0, the mean of the sampling distribution of MSR is larger
than MSE.
Comparing MSR and MSE should be useful for testing if β1 = 0.
Week 2 Regression Modelling 30 / 53
F test
To test
H0 : β1 = 0
Ha : β1 6= 0
We can use the test statistic:
MSR
F∗ =
MSE
What’s the distribution of F ∗ under the null hypothesis?
Week 2 Regression Modelling 31 / 53
F test
It can be proved that when β1 = 0
SSE
σ2
is distributed as χ2n−2 ;
SSR
σ2
is distributed as χ21 .
We also know that
With two independent χ2 distributed random variables Z1 and Z2 , with
degrees of freedome df1 and df2 , the ratio
Z1 /df1
Z2 /df2
will follow an F distribution with (df1 , df2 ) degrees of freedom.
Week 2 Regression Modelling 32 / 53
F test - decision rule
This is an upper-tail test. (why?)
With a significance level α, we reject H0 when
F ∗ > F (1 − α; 1, n − 2),
where F (1 − α; 1, n − 2) is the (1 − α)100 percentile of the F
distribution.
Week 2 Regression Modelling 33 / 53
Coefficient of determination (R 2 )
The coefficient of determination
SSR
R2 =
SSTO
measures the proportion of total variation in Y that can be explained
by the fitted regression model
0 ≤ R2 ≤ 1
In SLR, R 2 = r 2 , where r is the coefficient of correlation.
Week 2 Regression Modelling 34 / 53
Section 5
An example in R
Week 2 Regression Modelling 35 / 53
Toluca Company example from the textbook
Table 1.1 page 19
Use dataset from the R package “ALSM”
Or download from Wattle “Kutner Textbook Datasets”, file named
“CH01TA01.txt”
# install.packages("ALSM")
library("ALSM")
mydata <- TolucaCompany
# mydata <- read.table("CH01TA01.txt")
# need to put the data file into your working directory first
X <- mydata[,1]
Y <- mydata[,2]
X = “Lot Size” and Y = “Hours Worked”
Week 2 Regression Modelling 36 / 53
Scatter plot
#plot(mydata)
plot(X,Y, col="red", pch=17, xlab="Lot size", cex.lab=1.5,
ylab = "Work hours", main = "Toluca Company")
Toluca Company
500
400
Work hours
300
200
100
20 40 60 80 100 120
Lot size
Week 2 Regression Modelling 37 / 53
Summary statistics
summary(mydata)
## x y
## Min. : 20 Min. :113.0
## 1st Qu.: 50 1st Qu.:224.0
## Median : 70 Median :342.0
## Mean : 70 Mean :312.3
## 3rd Qu.: 90 3rd Qu.:389.0
## Max. :120 Max. :546.0
Week 2 Regression Modelling 38 / 53
Summary statistics
boxplot(mydata)
500
400
300
200
100
0
x y
Week 2 Regression Modelling 39 / 53
Fit the SLM manually
Recall we have
(Xi − X̄ )(Yi − Ȳ )
P
Sxy
b1 = =
Sxx (Xi − X̄ )2
P
b0 = Ȳ − b1 X̄
Xbar <- mean(X)
Ybar <- mean(Y)
Week 2 Regression Modelling 40 / 53
Fit the SLM manually
Xcenter <- X - Xbar
Ycenter <- Y - Ybar
Sxy <- crossprod(Xcenter, Ycenter)
# can also use
# Sxy <- sum(Xcenter*Ycenter)
# Sxy <- t(Xcenter)%*%Ycenter
Xcenter
## [1] 10 -40 -20 20 0 -10 50 10 30 -20 -30 0 20 -5
## [18] -20 20 40 -40 20 -30 10 0
Sxy
## [,1]
## [1,] 70690
# You can calculate Sxx similarly
Week 2 Regression Modelling 41 / 53
Fit the SLM manually
Sxx <- crossprod(Xcenter)
# Sxy <- sum(Xcenter^2)
Sxx
## [,1]
## [1,] 19800
b1 <- Sxy/Sxx
b0 <- Ybar - b1*Xbar
b0
## [,1]
## [1,] 62.36586
b1
## [,1]
## [1,] 3.570202
Week 2 Regression Modelling 42 / 53
Fit the SLM manually
Another way to calculate b1 ,
rxy sy
b1 = ,
sx
where rxy is the sample correlation between X and Y .
Week 2 Regression Modelling 43 / 53
Fit the SLM manually
b1 <- cor(X, Y)*sd(Y)/sd(X)
b1
## [1] 3.570202
Week 2 Regression Modelling 44 / 53
Fitting with “lm” function
mymodel <- lm(Y ~ X)
# without intercept: lm(Y ~ X -1)
# without slope: lm(Y ~ 1)
summary(mymodel)
Week 2 Regression Modelling 45 / 53
Fitting with “lm” function
##
## Call:
## lm(formula = Y ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -83.876 -34.088 -5.982 38.826 103.528
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.366 26.177 2.382 0.0259 *
## X 3.570 0.347 10.290 4.45e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 48.82 on 23 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8138
## F-statistic: 105.9 on 1 and 23 DF, p-value: 4.449e-10
Week 2 Regression Modelling 46 / 53
The estimated regression line
Ŷi = 62.366 + 3.570Xi
plot(X,Y, pch = 16)
abline(mymodel, col="purple", lty=2, cex=1.5, lwd=2)
500
400
Y
300
200
100
20 40 60 80 100 120
Week 2 Regression Modelling 47 / 53
Fitted Y values
Yhat <- b0 + b1*X
Yfit <- mymodel$fitted.values
round(Yhat)
## [1] 348 169 241 384 312 277 491 348 419 241 205 312 384 13
## [18] 241 384 455 169 384 205 348 312
round(Yfit)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 348 169 241 384 312 277 491 348 419 241 205 312 384 134 455
## 19 20 21 22 23 24 25
## 384 455 169 384 205 348 312
Week 2 Regression Modelling 48 / 53
Residuals
Res <- Y - Yhat
Res <- mymodel$residuals
SSE <- sum(Res^2)
SSE
## [1] 54825.46
n = length(Y)
MSE <- SSE/(n-2)
MSE
## [1] 2383.716
# Estimate sigma
sigma_hat = sqrt(MSE)
sigma_hat
## [1] 48.82331
# This is also called "Residual standard error"
Week 2 Regression Modelling 49 / 53
ANOVA - manually
# Total sum of squares
SSTO <- sum((Y - Ybar)^2)
SSTO
## [1] 307203
# Regression sum of squares
SSR <- sum((Yhat - Ybar)^2)
SSR
## [1] 252377.6
SSTO-SSR
## [1] 54825.46
# Regression mean square
MSR <- SSR/1
Week 2 Regression Modelling 50 / 53
ANOVA - F test
Fstat <- MSR/MSE
Fstat
## [1] 105.8757
critical <- qf(0.95, 1, n-2)
critical
## [1] 4.279344
pvalue <- 1- pf(Fstat, 1, n-2)
pvalue
## [1] 4.448828e-10
Week 2 Regression Modelling 51 / 53
ANOVA - by R function
anova(mymodel)
## Analysis of Variance Table
##
## Response: Y
## Df Sum Sq Mean Sq F value Pr(>F)
## X 1 252378 252378 105.88 4.449e-10 ***
## Residuals 23 54825 2384
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Week 2 Regression Modelling 52 / 53
Coefficient of Determination
SSR
R2 = SSTO
Rsqr <- SSR/SSTO
Rsqr
## [1] 0.8215335
Look at the summary output
Check with coefficient of correlation
cor(X, Y)
## [1] 0.9063848
cor(X, Y)^2
## [1] 0.8215335
Week 2 Regression Modelling 53 / 53