LECTURE 3
REGRESSION ANALYSIS
- MULTIPLE REGRESSION
1
AGENDA
Last class:
𝑌𝑖 = 0.326 + 0.1578 𝑋𝑖 For every $1 increase in taxi fare, what can we expect?
𝑟 2 = 0.5533 What does it say about our model?
𝐻0 : 𝛽1 = 0 p-value is very, very close to 0, which implies…
Basic Concepts of Multiple Linear Regression
Using Categorical (Dummy) Variables
Measures of Variation and Statistical Inference
2
FORMULATION OF MULTIPLE REGRESSION
MODEL
A multiple regression model is to relate one dependent variable with two or more
independent variables in a linear function
Population Intercept Population Slope Coefficients
𝑌𝑖 = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝛽2 𝑋2𝑖 + ⋯ + 𝛽𝐾 𝑋𝐾𝑖 + 𝜀𝑖
Dependent Variable Independent Variable Random Error
K is the number of independent variables (e.g., K = 1 for simple linear regression)
𝛽0 , 𝛽1 , 𝛽2 … , 𝛽𝐾 are the K+1 parameters in a multiple regression model with K independent
variables
𝑏0 , 𝑏1 , 𝑏2 … , 𝑏𝐾 are used to represent sample intercept and sample slope coefficients
3
MULTIPLE REGRESSION, 2 EXPLANATORY
VARIABLES
Say we have 𝑛 data points or 𝑛 observations
Our observations are in the form 𝑋11 , 𝑋21 , 𝑌1 , 𝑋12 , 𝑋22 , 𝑌2 , … , 𝑋1𝑛 , 𝑋2𝑛 , 𝑌𝑛
Observati Taxi – Pre- Ratecode ID Taxi - Tips (𝑿𝟏𝒊 , 𝑿𝟐𝒊 , 𝒀𝒊 )
on # tipped fare 1=NYC,
2=JFK
#1 8.30 1 1.65 (8.30, 1, 1.65)
#2 15.30 1 1.00 (15.30, 1, 1.00)
#3 7.80 1 1.25 (7.80, 1, 1.25)
#27 52.80 2 5.00 (14.80, 2, 3.70)
4
Source: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
TLC Trip Record Data: January 2019 Yellow Taxi Trip Records
Published by NYC Taxi & Limousine Commission We will need to “fix” this later…
FORMULATION OF MULTIPLE REGRESSION
MODEL
5
FORMULATION OF MULTIPLE REGRESSION
MODEL
Coefficients in a multiple regression net out the impact of each independent
variable in the regression equation
The estimated slope coefficient, 𝑏𝑗 , measures the change in the average value of
𝑌 as a result of a one-unit increase in 𝑋𝑗 , holding all other independent variables
constant – “ceteris paribus effect”
remain constant
∙ = 𝑏0 + 𝑏1 𝑋1∙ + 𝑏2 𝑋2∙ + ⋯ + 𝒃𝒋 𝑋𝑗∙ + ⋯ + 𝑏𝐾 𝑋𝐾∙
𝒀
6
EXAMPLE – USING CATEGORICAL (DUMMY)
VARIABLES
Last time, we did a simple linear regression on taxi fare and tips.
We want to see if the location also affects the tip.
Column E (RatecodeID) has 2 possibilities: 1= New York City, 2 = JFK Airport
Can we use column E as-is? Consider two trips from NYC and JFK, both with
fares of $10.
Observation Taxi – Pre- Ratecode ID What the model looks like
#𝒊 tipped fare 1=NYC, 2=JFK 𝒊 = 𝑏0 + 𝑏1 𝑋1𝒊 + 𝑏2 𝑋2𝒊
𝒀
𝑿𝟏𝒊 𝑿𝟐𝒊
e.g.1 10.00 1 𝑌1 = 𝑏0 + 10 𝑏1 + 𝑏2
e.g.2 10.00 2 𝑌2 = 𝑏0 + 10 𝑏1 + 2𝑏2 7
𝑏2 vs 2𝑏2 ? Double
the bonus?
USING CATEGORICAL (DUMMY) VARIABLES
Column E (RatecodeID) has 2 possibilities: 1= New York City, 2 = JFK Airport
Let’s define a new column: AreaID. We are “inside” the area if we are in NYC,
“outside” the area if we are NOT in NYC (i.e. JFK, etc).
We can pre-process the data so that 𝑋2𝑖 = 1 if we are inside NYC and 𝑋2𝑖 = 0
if we are outside NYC
Observation Taxi – Pre- Ratecode ID What the model looks like
#𝒊 tipped fare 1=NYC, 0=JFK 𝒊 = 𝑏0 + 𝑏1 𝑋1𝒊 + 𝑏2 𝑋2𝒊
𝒀
𝑿𝟏𝒊 𝑿𝟐𝒊
e.g.1 10.00 1 𝑌1 = 𝑏0 + 10 𝑏1 + 𝑏2
8
e.g.2 10.00 20 𝑌2 = 𝑏0 + 10 𝑏1
USING CATEGORICAL (DUMMY) VARIABLES
𝑋2𝑖 = 1 if we are inside NYC and 𝑋2𝑖 = 0 if we are outside NYC
Interpretation:
If 𝑏2 > 0: Everything else remaining constant, we expect to receive a bonus tip of
$|𝑏2 | when we pick up a passenger in NYC
If 𝑏2 < 0: Everything else remaining constant, we expect our tip to reduce by $|𝑏2 |
when we pick up a passenger in NYC.
This variable incorporates a fixed tip amount for NYC vs non-NYC trips, NOT a
change in the tips %!
Observation Taxi – Pre- Ratecode ID What the model looks like
#𝒊 tipped fare 1=NYC, 0=JFK 𝒊 = 𝑏0 + 𝑏1 𝑋1𝒊 + 𝑏2 𝑋2𝒊
𝒀
𝑿𝟏𝒊 𝑿𝟐𝒊
e.g.1 10.00 1 𝑌1 = 𝑏0 + 10 𝑏1 + 𝑏2
9
e.g.2 10.00 20 𝑌2 = 𝑏0 + 10 𝑏1
USING CATEGORICAL (DUMMY) VARIABLES
Useful when an explanatory variable isn’t numerical (e.g. colours, locations)
Use 0, 1 variables: 0 = “is not, does not fit definition”, 1 = “is, fits definition”
If a category has 𝑐 choices, then we need 𝑐 − 1 categorical variables
E.g. Product design: A product can be red, yellow, or blue. We want to see how
colour affects popularity. In a regression model, we need 2 categorical variable
𝑋1 = 1 if it is red, and 0 otherwise
𝑋2 = 1 if it is yellow, and 0 otherwise
Obs # 𝒊 Red? Yellow? What the model looks like
𝑿𝟏𝒊 𝑿𝟐𝒊 𝒊 = 𝑏0 + 𝑏1 𝑋1𝒊 + 𝑏2 𝑋2𝒊 + ⋯
𝒀
e.g.1 (Red) 1 0 𝑌1 = 𝑏0 + 𝑏1 + ⋯
10
e.g.2 (Yellow) 0 1 𝑌2 = 𝑏0 + 𝑏2 + ⋯
e.g. 3 (Blue) 0 0 𝑌3 = 𝑏0 + ⋯
BUILDING THE MODEL
After fixing the categorical variable for AreaID, we can fill in the regression
window.
11
MODEL OUTPUT
Excel’s Output:
= 𝟏. 𝟑𝟕𝟕𝟏 + 𝟎. 𝟏𝟒𝟖𝟖 𝑿𝟏 − 𝟎. 𝟗𝟓𝟐𝟏 𝑿𝟐
𝒀
12
*Scientific notation: 1.7284E − 226 = 1.7284 × 10−226 ≈ 0
INTERPRETATION OF ESTIMATES
The estimated multiple regression equation:
𝑌 = 1.3771 + 0.1488 𝑋1 − 0.9521 𝑋2
𝑌 = Estimated taxi tips in NYC in $
𝑋1 = Pre-tip amount in $
𝑋2 = Area indicator (NYC =1, Non-NYC (JFK) = 0)
Interpretation of the estimated slope coefficient:
𝑏1 = 0.1488 says that the estimated average tips increase by $0.1488 for each $1
increase in pre-tip taxi fare, given that other independent variables remain constant
𝑏2 = −0.9521 says that the estimated average tips decrease by $0.952 when starts in
NYC instead of JFK, given that other independent variables remain constant
13
COMPARISON OF MODELS
Suppose we add more explanatory variables
𝑋1 = Pre-tip amount in $
𝑋2 = Area indicator (NYC =1, Non-NYC (JFK) = 0)
𝑋3 = # of riders
𝑋4 = New Year’s Day indicatory (Jan 1 =1, otherwise =0)
𝒀
= 𝟏. 𝟑𝟏𝟖𝟏 + 𝟎. 𝟏𝟒𝟖𝟓 𝑿𝟏 − 𝟎. 𝟗𝟓𝟎𝟏 𝑿𝟐
+ 𝟎. 𝟎𝟒𝟎𝟒𝑿𝟑 + 𝟎. 𝟎𝟓𝟎𝟑𝑿𝟒
14
INTERPRETATION OF ESTIMATES
Multiple regression model:
𝑌 = 1.3181 + 0.1485 𝑋1 − 0.9501 𝑋2 + 0.0404𝑋3 + 0.0503𝑋4
The estimated slope coefficient
𝑏1 = 0.1485 says that the estimated average tips increase by $0.1485 for each $1
increase in pre-tip taxi fare, holding all other things equal
𝑏2 = −0.9521 says that the estimated average tips decrease by $0.952 when starts in
NYC instead of non-NYC (JFK), holding all other things equal
𝑏3 = 0.0404 says that the estimated average tips increase by $0.0404 for each
additional rider, holding all other things equal
𝑏4 = 0.0503 says that the estimated average tips increase by $0.0503 if it it on New
year day, holding all other things equal
15
EVALUATE THE MODEL
𝑟 2 and adjusted 𝑟 2
F-test for overall model significance
t-test for a particular 𝑋-variable significance
16
MEASURES OF VARIATION - 𝑟 2
𝑌 = 1.3181 + 0.1485 𝑋1 − 0.9501 𝑋2 + 0.0404𝑋3 + 0.0503𝑋4
Total variation of the 𝑌-variable is made up of two parts
𝑆𝑆𝑇 = 𝑆𝑆𝑅 + 𝑆𝑆𝐸
where
ത 2
𝑆𝑆𝑇 = σ𝑛𝑖=1(𝑌𝑖 − 𝑌) SSR - regression SSE - error
𝑆𝑆𝑅 = σ𝑛𝑖=1(𝑌𝑖 − 𝑌)
ത 2 𝑌ത 𝑌𝑖 𝑌𝑖
𝑆𝑆𝐸 = σ𝑛𝑖=1(𝑌𝑖 − 𝑌𝑖 )2
Pre-tip New Year’s
fare Area # of Day 17
passengers
MEASURES OF VARIATION - 𝑟 2
We can ALWAYS increase 𝑟 2 by adding variables that don’t explain the changes in 𝑌
Easier to see with less data. See “r-squared comparison” tab in spreadsheet
We add one more column of 0/1s. 1 = odd number row, 0 = even number row
Vs.
18
MEASURES OF VARIATION - 𝑟 2
What is the net effect of adding a new 𝑋-variable?
𝑟 2 increases , even if the new 𝑋-variable is explaining an insignificant proportion of the
variation of the 𝑌-variable
Is it fair to use 𝑟 2 for comparing models with different number of 𝑋-variables?
A degree of freedom* will be lost, as a slope coefficient has to be estimated for that
new 𝑋-variable
Did the new 𝑋-variable add enough explanatory power to offset the loss of one degree of
freedom?
Degree of freedom on the residual = 𝑛 − 𝐾 + 1 = 𝑛 − 1 − 𝐾
*Degrees of freedom: Number of independent pieces of information (data values) in the random sample.
If 𝐾 + 1 parameters (intercept, slopes) must be estimated before the sum of squares errors, SSE, can be calculated from a sample of size
n, the degrees of freedom are equal to 𝑛 − (𝐾 + 1) (𝐾 + 1 coefficients of b0, b1, …, bK).
19
MEASURES OF VARIATION – ADJUSTED 𝑟 2
𝑆𝑆𝐸
(Recall: 𝑟 2 = 1 − 𝑆𝑆𝑇)
𝑆𝑆𝐸Τ 𝑛−𝐾−1 𝑛−1
Adjusted 𝑟 2 = 1 − = 1− (1 − 𝑟 2 )
𝑆𝑆𝑇Τ 𝑛−1 𝑛−𝐾−1
Measures the proportion of variation of the 𝑌 values that is explained by the
regression equation with the independent variable 𝑋1 , 𝑋2 , … , 𝑋𝐾 , after the
adjusting for sample size (𝑛) and the number of 𝑋-variables used (𝐾)
Smaller than or equal to 𝑟 2 , and can be negative
Penalize the excessive use of 𝑋-variables
Useful in comparing among models with different number of 𝑋-variables
20
EXAMPLE – ADJUSTED 𝑟 2
Compare the models that we’ve built
Number of Observations: 197,103
SST: 1,163,798
1 explanatory 2 explanatory 4 variables
variable (pre-tip variables (pre-tip
fare) fare, area ID)
Degree of freedom – 197,101 197,100 197,098
residual
SSE 519,852 517,136 516,911
𝑟2 0.553314 0.555647 0.555841
21
Adjusted 𝑟2 0.553312 0.555643 0.555832
INFERENCE: OVERALL MODEL SIGNIFICANCE
Is the model significant? Do we need a model?
F-test
22
OVERALL MODEL SIGNIFICANCE: F-TEST
F-test for the overall model significance
Null hypothesis 𝐻0 : 𝛽1 = 𝛽2 = ⋯ = 𝛽𝐾 = 0 (none of the 𝑋-variables affects 𝑌)
Alternative hypothesis: 𝐻1 : 𝐴𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝛽𝑖 ≠ 0 (at least one 𝑋-variable affects 𝑌)
We want to REJECT the null hypothesis by showing that the probability of seeing
our value of 𝑏1 , 𝑏2 , … , 𝑏𝐾 is “low” if it 𝐻0 was indeed true.
F-statistic : For SSR For SSE
𝑀𝑆𝑅 𝑆𝑆𝑅/𝐾
F = 𝑀𝑆𝐸 = 𝑆𝑆𝐸/(𝑛−𝐾−1) with 𝐾, (𝑛 − 1 − 𝐾) degrees of freedom (d.f.)
23
OVERALL MODEL SIGNIFICANCE: F-TEST
𝑀𝑆𝑅 𝑆𝑆𝑅/𝐾
F= = 𝑆𝑆𝐸/(𝑛−𝐾−1) with 𝐾, (𝑛 − 1 − 𝐾) degrees of freedom (d.f.)
𝑀𝑆𝐸
First decide on size of rejection region 𝛼 (one tails) Level of significance
Method 1 (with F-table): Rejection region approach
Reject 𝐻0 if F > critical value (C.V.) = 𝐹𝛼,𝐾,(𝑛−𝐾−1)
Method 2 (with Excel output): p-value approach
p-value = 𝑃(𝐹 ≥ F)
Reject 𝐻0 if p-value < 𝛼
24
OVERALL MODEL SIGNIFICANCE: F-TEST
Probability distribution of F. Suppose 𝛼 = 0.05
At 5% significance level, p-value 0 < 5%. Therefore 𝐻0
is rejected.
= tail area = P(F ≥ C. V.)
𝐩 − 𝐯𝐚𝐥𝐮𝐞 = P(𝐹 ≥ F)
F
0 C. V. = F =61,664, calculated
𝐹𝛼,𝐾,(𝑛−𝐾−1) =
from sample data
𝟐. 𝟑𝟕
25
SIGNIFICANCE OF A PARTICULAR X-VARIABLE:
T-TEST
Even if we reject the 𝐻0 in our F-test, we cannot distinguish which 𝑋-variable(s)
has a significant impact on the 𝑌-variable
t-test for a particular 𝑋-variable’s significance
Null 𝐻0 : 𝛽𝑖 = 0 (𝑋𝑖 has no linear relationship with 𝑌, given presence of other 𝑋-
variable(s))
Alternative 𝐻1 : 𝛽𝑖 ≠ 0 (𝑋𝑖 is linearly related to 𝑌, given presence of other 𝑋-
variable(s))
26
SIGNIFICANCE OF A PARTICULAR X-VARIABLE:
T-TEST
Null 𝐻0 : 𝛽1 = 0
Method 1: Rejection region approach
Reject 𝐻0 if T > C. V. = 𝑡𝛼Τ2,(𝑛−𝐾−1)
Method 2: p-value approach
p-value = 𝑃(|T| ≥ |t|)
Reject 𝐻0 if p-value < 𝛼
Student’s t-distribution
Probability
standard 𝛼
𝛼
error
2 2
If 𝛼 = 5%,
𝑡 then 𝑡0.025 ,(𝑛−5) ≈
-348.81 C.V.= 𝜷𝟏 = 𝟎 C.V.= t=348.81 27
-1.96 1.96 1.96
EXAMPLE
Conclusion: p-value is smaller than 5%, so reject 𝐻0 . The pre-tip fare is significantly
related to the tips, given presence of other 𝑋-variables.
What about the other variables?
*
*
According to the t-test results, the p-value for each of the four explanatory variables
is smaller than 5%,.
This indicates each explanatory variable is significantly related to tips paid in NYC,
given presence of other 𝑋-variables.
*Scientific notation: 6.41657E − 08 = 6.41657 × 10−8 = 0.0000000642 ≈ 0
28
EXAMPLE
What does the table look like if there is an insignificant explanatory variable?
Added fifth variable to labels rows as “odd” or “even” (“5var – odd/even” tab)
The p-value for “Odd/Even transaction” is LARGER than 5%, so we cannot reject
𝐻0 . This indicates that odd/even transactions is not significantly related to tips
paid in NYC, given presence of other 𝑋-variables. 29
VARIABLES SELECTION STRATEGIES
Some of the independent variables are insignificant based on t-test results
We may consider eliminating insignificant independent variables using the following
methods:
All possible regressions
Backward elimination
Forward selection
Stepwise regression
30
ALL POSSIBLE REGRESSIONS
To develop all the possible regression models between the dependent variable
and all possible combinations of independent variables
If there are 𝐾 𝑋-variables to consider using, there are (2𝐾 −1) possible
regression models to be developed
The criteria for selecting the best model may include
Mean Sum of Squares Errors (MSE)
Adjusted 𝑟 2
Disadvantages of all possible regressions
No unique conclusion, with different criteria, different conclusions will arise
Look at overall model performance, but not individual variable significance
When there is a large number of potential 𝑋-variables, computational time can be long
31
BACKWARD ELIMINATION
Evaluate individual variable significance 𝑋1 , 𝑋2 , 𝑋3 , 𝑋4 , 𝑋5
Step 1: Build a model by using all potential 𝑋-variables 𝑋1 , 𝑋2 , 𝑋3 , 𝑋4
Step 2: Identify the least significant 𝑋-variable using t-test
Step 3: Remove this 𝑋-variable if its p-value is larger than the specified level of
significance; otherwise terminate the procedure
Step 4: Develop a new regression model after removing this 𝑋-variable, repeat
steps 2 and 3 until all remaining 𝑋-variables are significant
32
FORWARD SELECTION
nothing
Evaluate individual variable significance
𝑋1 𝑋2 𝑋3 𝑋4 𝑋5
Step 1: Start with a model which only contains the intercept term
Step 2: Identify the most significant 𝑋-variable using t-test 𝑋1 , 𝑋2 𝑋1 , 𝑋3 𝑋1 , 𝑋4 𝑋1 , 𝑋5
Step 3: Add this 𝑋-variable if its p-value is smaller than the specified
level of significance; otherwise terminate the procedure
Step 4: Develop a new regression model after including this 𝑋-variable,
repeat steps 2 and 3 until all significant 𝑋-variables are entered
33
STEPWISE REGRESSION
Evaluate individual variable significance
An 𝑋-variable entering can later leave; an 𝑋-variable eliminated can later go back in
Step 1: Start with a model which only contains the intercept term
Step 2: Identify the most significant 𝑋-variable, add this 𝑋-variable if its p-value is smaller
than the specified level of significance; otherwise terminate the procedure
Step 3: Identify the least significant 𝑋-variable from the model, remove this 𝑋-variable if
its p-value is larger than the specified level of significance
Step 4: Repeat steps 2 and 3 until all significant 𝑋-variables are entered and none of them
have to be removed
34
PRINCIPLE OF MODEL BUILDING
A good model should
Have few independent variables
Have high predictive power
Have low correlation between independent variables
Be easy to interpret
35