Chapter 1 Introduction
STAT 3008 Applied Regression Analysis
Department of Statistics
The Chinese University of Hong Kong
2021/22 Term 1
Dr. LEE Pak Kuen, Philip
1
Chapter Outline
• Section 1.1: Motivation
• Section 1.2: Five Examples on Regression
• Section 1.3: Installation of R and R libraries
• Section 1.4: Mean and Variance Functions
• Section 1.5: Separated Points
• Section 1.6: Scatterplot
2
Section 1.1
Motivation
3
Motivation: Example
• Problem of Interest: Want to predict the Overall GPA of
students in CUHK
• Methodology:
1. Select a random sample of students graduated from CUHK
2. Record the following for each student:
• Overall GPA
• Properties from students: E.g. IQ, AL-results, Major,
Gender, … etc
3. Use the above information to predict the overall GPA of
current students
4
Motivation: Example
• Points not exactly on a straight line, why?
• How to use a mathematical model to relate Y (GPA) and X (IQ)?
5
Linear Regression in a Page
Steps:
1. Select a random sample of
students graduated from
CUHK
2. Record y(GPA) and x(IQ)
3. Plot (x,y) on a scatterplot
4. Find a straight line equation
that fits the data points best GPA=2.00+0.01(IQ)
5. Predict the GPA using a
new student’s IQ
Regression studies the dependency between
Explanatory Variables (X) and the Response Variable (Y)
6
Linear Regression Y = a +bX
Regression studies the dependency between the
Explanatory Variables (X) and Response Variable (Y)
• Explanatory Variable (EV) X: Also known as predictor, or
independent variable
• Response Variable (RV) Y: Also known as dependent variable
Linear Regression – Typical Problem of Interests
• Obtain the best estimates from a regression line (I.e. the
intercept a and the slope b).
• Predict the value of the RV, based on a new set of EVs.
• Identify the EVs which are important to explain the RV.
• Is the regression line good enough to explain the data? If not,
how can we extend the regression line to a more complicated
model?
7
Section 1.2
Five Examples on Regression
8
Examples on Regression
• Next few pages: Examples with data available in R (alr4 library)
• Messages from the examples:
• Will go through some of those examples in details in later
chapters
9
Example 1 – Inheritance of Height
Problem of interest: Want to study how the Daughter’s height is
affected by the Mother’s height
Data: n = 1,375 families
x = Heights (in inches) of
mothers in the UK under age
65 (Mheight)
y = Heights (in inches) of one
of their adult daughters over
age 18 (Dheight)
Question: Can we interchange x and y?
10
Example 1 – Inheritance of Height
x = Heights (in inches) of mothers in the UK under age 65 (Mheight)
y = Heights (in inches) of one of their adult daughters over age 18 (Dheight)
Findings from the Scatterplot:
• Dheight increases with Mheight
• The two variables are of similar range
(55-70 inches)
• The points appear to form an
elliptical region*
=> Linear regression would make
sense
* (STAT2001): Joint pdf of Bivariate Normal is elliptical in shape
X x x2 x y x
~ N 2 , Y | X x ~ N ( x ), (1 2
) 2
Y y x y y2 y x x
y
Given the Mheight = x (inches), Dheight is normally distributed with constant
variance (1 2 ) x2 . 11
Example 2 – Forbes’ Data
• Barometer (氣壓計) was a fragile instrument to measure
atmospheric pressure in 1850s.
• James D. Forbes (1857): Use the boiling point of water as
a substitute (which is more reliable based on a
thermometer) of the measurement of atmospheric
pressure
• At 17 different locations in the Alps and the Scotland, he
measured
• the pressure (in inches of mercury) using a barometer,
and
• the boiling point of water (in F)
• Question: Does the boiling point of water vary with
atmospheric pressure in a linear way?
12
Example 2 – Forbes’ Data
• High Altitude: Low Atmospheric Pressure and Low Boiling Point of Water
• Low Altitude: High Atmospheric Pressure and High Boiling Point of Water
13
Example 2 – Forbes’ Data
• x = boiling point of water (in Fahrenheit)
y = atmospheric pressure (in inches of mercury)
• Residual Plot on the Right: Presence of systematic error
(quadratic relationship?) between x and y
Residual = y – “Fitted Value of y”
Outlier
14
Example 2 – Forbes’ Data
• Data Transformation: y = log(atmospheric pressure)
=> Points fall closer to a horizontal line
Outlier
• General Procedure: Understand the data (via the scatterplot)
• Fit a linear regression to the data (Ch2-3)
• Understand the residuals based on Residual Analysis (Ch7) and
make necessary Data Transformation (Ch8) 15
Example 3 – Length at Age for Smallmouth Bass
• Background: Smallmouth Bass (小
嘴鱸魚) is a popular game fish in
North America
• Problem of interest: Avoid
excessive fishing => Would like to
set impose fishing regulation to
protect the young smallmouth bass
(based on its length)
• Want to study the growth pattern
(age vs length) of fish:
• y = Length of small mouth bass
at capture (in mm)
• x = Age of small mouth bass at
capture (in year)
Linear relationship between length
and age 16
Example 3 – Length at Age for Smallmouth Bass
• Dash line: Connects the average length of fish at each age group.
i.e. Sample mean length of fish at age i, for i = 1, 2, …, 8
Need 8 numbers to summarize the locations (i.e. 1st moment) of the
data
• Solid line: Regression line
y = a + bx
Only 2 numbers are required to
relate the 8 locations of length
by age
=> Regression provide a
simpler model to the data
17
Example 4 – Predicting the Weather
Problem of Interest: Can early snowfall (Sep 1st to Dec 31st) predict
late snowfall (Jan 1st to Jun 30th next year) at Fort Collins, Colorado?
• Money magazine's Best Place to Live in the
U.S. in 2006
• One of the towns that inspired the design of
Main Street, U.S.A. inside the main entrance
of the many 'Disneyland'-style parks
Fort
Collins
18
Example 4 – Predicting the Weather
Problem of Interest: Can early snowfall (Sep 1st to Dec 31st) predict
late snowfall (Jan 1st to Jun 30th next year) at Fort Collins, Colorado?
• x = Early Snowfall (in inches)
from Sep 1st to Dec 31st
• y = Late Snowfall (in inches)
from Jan 1st to Jun 30th next year
• Yearly data from 1900 to 1992
(n=93)
• Dash line = Fitted Regression line
• Solid line = Average Late Winter
Snowfall level (with slope=0)
• “Can Early Snowfall predict Late Snowfall?“
Hypothesis Testing: The slope is significantly different from 0?
19
Example 5 – Turkey Growth
• A farmer would like to increase the yield of turkeys (火雞) through
the use of amino acids => How weight gain of turkey is affected by
(1) Type of amino acid supplement (A Categorical Variable!!)
(2) Amount of amino acid supplement (% of total diet)
% of amino Amino Acid #1 Amino Acid #2 Amino Acid #3
acid in diet
4%
10%
• Record the average weight gain of each turkey pen (欄) 20
Example 5 – Turkey Growth
• y = Average weight gain (in gram) of turkeys in a pen
• x = Dose of Amino Acid Supplement (as a percentage of total diet)
• Circle/Triangle/Cross:
3 different type of
amino acids supplement
in their diet
• Challenges:
• Non-linear relationship between x and y (Ch5 Polynomial
Regression)
• Inclusion of the type of amino acid (a categorical variable) to
the regression model (Ch5 Dummy Variables) 21
Section 1.3
Installation of R and R Libraries
22
Installation of R
• Dataset from the textbook (and this course) are available in the R
libraries “car” and “alr4”
• Require R of version 3.5.0 or higher (current version: 4.1.1)
Installation of R
1. Go to http://cran.r-project.org/bin/windows/base/ and “Download
R 4.1.1 for Windows”
Mac OS X: R-4.1.1.pkg from https://cran.r-project.org/bin/macosx/
2. Run the .exe file to install R. The default folder of the software is
“C:\program files\R\R-4.1.1\”
23
Installation of R Libraries (car and alr4)
• Most data sets from this course is stored in the “alr4” library
• The installation of the “alr4” library is messy because it depends on a lot of
other libraries as follows:
curl rio data.table
carData car alr4 haven Rcpp forcats …
### The following R codes are available in RcodeCh1.r on Blackboard ###
### Make sure you are not using super version of R ###
Install alr4 library
update.packages(repos="http://cran.rstudio.com/",checkBuilt = TRUE, ask =F) and libraries it
install.packages(c("carData","car","effects","rio","curl","data.table","haven","Rcpp",
"forcats", "magrittr","hms","rlang","vctrs","zeallot","backports","pkgconfig","tibble",
depends on
"pillar","crayon","openxlsx","alr4"),repos="http://cran.rstudio.com/", dependencies=TRUE) (would take 5-10
library(car) # Load the car package
library(alr4) # Load the alr4 package for Forbes data below
mins!)
Temperature<-Forbes*,1+ # Object “Temperature” from the 1st column of “Forbes” data
Pressure<-Forbes*,2+ # Object “Pressure” from the 2nd column of “Forbes” data Test if the alr4
par(mfrow=c(1,2)) # Set the Graphical screen to 1 row & 2 columns library works
plot(Temperature,Pressure) # Scatterplot of Temperature (x) and Pressure (y)
fit0<-lm(Pressure~Temperature) # Create linear regression (lm) object (fit0) using the Forbes
abline(fit0) # Draw the regression line data (2nd
Residuals<-fit0$residuals # Object “Residuals” extracted from the object “fit0”
plot(Temperature,Residuals) # Scatterplot of Temperature (x) and Residuals (y)
Example Earlier)
abline(h=0,lty=2) # Draw the x-axis using a dotted line (line type = 2) 24
Installation of R: Step-by-Step
1) In Rx64 (4.1.1) or Rxi386 (4.1.1), File (from Menu Bar) -> New Window =>
A window called the R Editor is created at the bottom right
2) Copy the R codes from the previous page to the R Editor
[Alternative of step 1) and 2): File (from Menu Bar) -> Open Script and choose
the RcodeCh1.r file you downloaded from the course Blackboard]
3) Select All (Ctrl+A), then Run line or selection (Ctrl+R). The codes will be
executed at the R console on the left
25
Section 1.4
Mean and Variance Functions
26
Mean and Variance Functions
• Consider data {(xi, yi), i = 1, 2, … ,n}
• x is called the Explanatory Variable (EV) [also called the
Predictor or Independent Variable]
• y is called the Response Variable (RV) [also called the
Dependent Variable]
• Model Assumption when setup of a regression model:
1. Mean Function E (Y | X x )
2. Variance Function Var(Y | X x)
1. Mean Function is the expected value of the response when the
explanatory variable X=x :
E (Y | X x ) f ( x )
• Linear Regression: f (x) = a + bx,
• Quadratic Regression: f (x) = a + b1 x + b2 x2 27
Mean Function – Inheritance of Height
Example: Inheritance of Height (Mother’s height vs Daughter’s height)
y=x
y = ax + b
• Mean Function E(Dheight | Mheight = x) = a + b x ,
where a (intercept), b (slope) are the parameters of the linear regression
• b < 1, with E(Dheight | Mheight = 70) = 68 (Why?)
28
Mean Function – Turkey Data
• Possible (non-linear)
mean function:
E(Growth| Dose = x)
= β0 + β1 [1-exp(- β2 x)]
• Interpretation of Parameters
• β0: Baseline growth (i.e. growth without amino acid supplement)
• β1: Max. effect of amino acid, with y 0 1 as x
• β2: The speed to achieve the maximum growth
29
Variance Function
• Variance Function is assumed to be constant (mostly unknown)
throughout the course:
Constant
Var(Y | X x) 2
variability
That is, variance of the response
is THE SAME for all values of x
Why constant variance?
Because good statistical
properties of the estimators
• Example: Heights Data
• Var (Y| X = x) = σ2
• Scatterplot: The variance function is approximately the same
along different values of x
30
Reasonable Assumption on Constant Variance?
31
Section 1.5
Separated Points
32
Four Hypothetical Data Sets
• [Textbook Table 1.1] 4 different data sets {(xi,yi), i=1, 2,…, 11}
Same summary
statistics:
{x , y , s x2 , s 2y , s xy }
• (Ch2) Estimates of y = a + bx depends on the 5 summary
statistics only => Same regression lines for the 4 data sets 33
Four Hypothetical Data Sets
Conclusions from the above:
1. Dependence is not limited to E(Y|X) = a + bX. (e.g. polynomial)
2. Summary statistic (i.e. a, b from regression line) may not be a good
summary of dependence
Should first understand the data graphically (e.g. scatterplot)
before fitting a regression line 34
Separated Points
• Separated Points: Points are well separated from the other points,
either horizontally or vertically
• Does the presence of separated points affect the regression line?
35
Separated Points
• Horizontal: Leverage point (i.e. leverage effect to the line)
The location of the leverage
point (x, y) has higher
impact to the regression
line (i.e. leverage) than the
other points
• Vertical: Outlier (i.e. lie outside the line)
Outlier typically does not
affect the regression line
much
36
Section 1.6
Scatterplot
37
Why Scatterplot?
• Scatterplot uses Cartesian (x-y) coordinates to displays values
of two variables.
• Scatterplot is able to identify the following:
1. the mean function Inheritance of Height Data
2. the variance function
3. separated points
1. Mean function:
Linear
2. Variance function:
Constant
3. No separated point
38
Null Plot
• Null plot is a scatterplot with
1. constant mean function (slope=0)
2. constant variance function
3. no separated point
Snowfall data
Null plot on the residuals => Linear Regression is a reasonable
model to the data 39
Scatterplot Matrix
What to do if there are more than 2 variables?
Answer: Draw a scatterplot for EACH PAIR of variables => Scatterplot Matrix
• Only marginal relationship between two variables is observed.
• Joint relationship (e.g. Interaction of 3 or more variables)??
Example: Fuel Consumption Data on the Next Page
Problem: Understand how fuel consumption varies over 50 states in the US,
understand the effect on fuel consumption of state gasoline tax.
• Fuel (y) – Gasoline (in thousand of gallon) sold for road use per
population age 16+
• Tax (x1) – Gasoline state tax rate (cents per gallon)
• Dlic (x2) – 1,000(# of licensed drivers/ population of age 16+) in that state
• Income (x3) – Personal income (in US$1,000)
• logMiles (x4) – Log (Total length of highway [in miles] of that state)
Scatterplot Matrix: Next Page 40
Scatterplot Matrix
41
Generate Scatterplot Matrix in R – the “pairs” Function
Example: Generate the Scatterplot Matrix for the Fuel data
library(car); library(alr4)
Tax<-fuel2001$Tax # Gasoline state tax rate
Dlic<-fuel2001$Drivers/fuel2001$Pop # No. of Drivers / population over age 16
Income<-fuel2001$Income # Personal Income
logMiles<-log(fuel2001$Miles,2) # Log (total length of highway)
Fuel<-fuel2001$FuelC/fuel2001$Pop # Amnt of Gasoline sold per population over age 16
Data<-cbind(Tax,Dlic,Income,logMiles,Fuel)
# Bind the 5 objects (by columns) into the matrix of 5 columns
pairs(Data) # Generate the Scatterplot Matrix of “Data”
Generate a scatterplot: Use the “plot” function
plot(logMiles, Fuel)
42
Correlation Heatmap
Correlation Heatmap: Graphical illustration of the correlation matrix
• Quick and dirty way to summarize how the linear association (i.e.
correlation) between each pair of variables
• Works particularly good for data with A LOT of variables
• Unable to visualize the linearity and possible separated points
install.packages("corrplot")
library(corrplot)
par(mfrow=c(1,1)) # Set the Graphical screen to 1
row & 1 column
M <- cor(Data) # Compute the correlation matrix
corrplot(M, method = "color") # Heatmap of M
43