Modelling cycle
This leads us into a modelling cycle
Fit
Examine residuals
Transform data or change model if necessary
This cycle is repeated until we are “happy” with the
fitted model
Diagramatically….
H.M.F 15
Modelling cycle
Choose Model Plots, theory
Fit model
Transform/ Examine residuals
change
Good fit
Bad fit
Use model
H.M.F 16
Exa: U. S. State Public-School Expenditures
Data (Anscombe) for the 50 states of the USA:
Library(car); data(Anscombe)
Variables are:
Per capita expenditure on education (response),
variable education
Per capita Income, variable income
Number of residents per 1000 under 18, variable
young
Number of residents per 1000 in urban areas, variable
Urban
Fit model: education~ income+young+urban
H.M.F 17
200 250 300 350 400 450 500 550 300 320 340 360 380
300 400 500 600 700 800 900
urban
Outlier!
500
response
(response)
400
educ
300
200
3500 4000 4500 5000 5500
percap
380
360
340
under18
320
300
300 400 500 600 700 800 900 3500 4000 4500 5000 5500
H.M.F 18
Basic fit, outlier in
educ.lm = lm(education~ income+young+urban,
data=Anscombe)
>summary(educ.lm)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.868e+02 6.492e+01 -4.418 5.82e-05 ***
income 8.065e-02 9.299e-03 8.674 2.56e-11 ***
young 8.173e-01 1.598e-01 5.115 5.69e-06 ***
urban -1.058e-01 3.428e-02 -3.086 0.00339 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 26.69 on 47 degrees of freedom
Multiple R-squared: 0.6896, Adjusted R-squared: 0.6698
F-statistic: 34.81 on 3 and 47 DF, p-value: 5.337e-12
R2 is 69%
H.M.F 19
Basic fit, outlier out
> educ50.lm = lm(education~ income+young+urban,
data=Anscombe,subset=-50)
>summary(educ50.lm) See how we exclude pt 50
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -242.18729 82.19788 -2.946 0.005033 **
income 0.07432 0.01173 6.336 9.07e-08 ***
young 0.71232 0.19901 3.579 0.000826 ***
urban -0.08657 0.04060 -2.132 0.038369 *
---
Signif. codes: 0 „***‟ 0.001 „**‟ 0.01 „*‟ 0.05 „.‟ 0.1 „ ‟ 1
Residual standard error: 26.75 on 46 degrees of freedom
Multiple R-squared: 0.5692, Adjusted R-squared: 0.5411
F-statistic: 20.26 on 3 and 46 DF, p-value: 1.636e-08
R2 is now 57%
H.M.F 20