Regression diagnostics {reg-diag}
Diagnostic plots
Regression diagnostics plots can be created using the
R base function plot() or the autoplot() function
[ggfortify package], which creates a ggplot2-based
graphics.
Create the diagnostic plots with the R base function:
par(mfrow = c(2, 2))
plot(model)
library(ggfortify)
autoplot(model)
H.M.F 10
For the second option
The diagnostic plots show residuals in four different ways:
Residuals vs Fitted. Used to check the linear relationship assumptions. A
horizontal line, without distinct patterns is an indication for a linear
relationship, what is good.
Normal Q-Q. Used to examine whether the residuals are normally distributed.
It’s good if residuals points follow the straight dashed line.
Scale-Location (or Spread-Location). Used to check the homogeneity of
variance of the residuals (homoscedasticity). Horizontal line with equally
spread points is a good indication of homoscedasticity.
Residuals vs Leverage. Used to identify influential cases, that is extreme
values that might influence the regression results when included or excluded
from the analysis. This plot will be described further in the next sections.
H.M.F 11
Outliers and high levarage points
Outliers:
An outlier is a point that has an extreme outcome variable value. The presence of
outliers may affect the interpretation of the model, because it increases the SE.
Outliers can be identified by examining the standardized residual (or studentized
residual), which is the residual divided by its estimated standard error.
Observations whose standardized residuals are greater than 3 in absolute value are
possible outliers (James et al. 2014).
High leverage points:
A data point has high leverage, if it has extreme predictor x values. This can be
detected by examining the leverage statistic or the hat-value. A value of this statistic
above 2(p + 1)/n indicates an observation with high leverage (P. Bruce and Bruce
2017); where, p is the number of predictors and n is the number of observations.
Outliers and high leverage points can be identified by inspecting the Residuals vs
Leverage plot:
H.M.F plot(model, 5) 12
Influential values
An influential value is a value, which inclusion or exclusion can alter the
results of the regression analysis. Such a value is associated with a large
residual.
Not all outliers (or extreme data points) are influential in linear regression
analysis.
Statisticians have developed a metric called Cook’s distance to determine
the influence of a value. A rule of thumb is that an observation has high
influence if Cook’s distance exceeds 4/(n - p - 1)(P. Bruce and Bruce 2017),
where n is the number of observations and p the number of predictor
variables.
The Residuals vs Leverage plot can help us to find influential observations
if any.
On this plot, outlying values are generally located at the upper right corner
or at the lower right corner. Those spots are the places where data points
can be influential against a regression line.
The following plots illustrate the Cook’s distance and the leverage of our
model:
H.M.F 13
summary(influence.measures(model))
H.M.F 14
Modelling cycle
This leads us into a modelling cycle
Fit
Examine residuals
Transform data or change model if necessary
This cycle is repeated until we are “happy” with the
fitted model
Diagramatically….
H.M.F 15