LECTURE 3
REGRESSION ANALYSIS
- MULTIPLE REGRESSION
1
AGENDA
Basic Concepts of Multiple Linear Regression
Data Analysis Using Multiple Regression Models
Measures of Variation and Statistical Inference
2
FORMULATION OF MULTIPLE REGRESSION
MODEL
A multiple regression model is to relate one dependent variable with two or
more independent variables in a linear function
Population Intercept Population Slope Coefficients
𝑌 𝛽 𝛽𝑋 𝛽 𝑋 ⋯ 𝛽 𝑋 𝜀
Dependent Variable Independent Variable Random Error
𝑏 , 𝑏 , 𝑏 … , 𝑏 are used to represent sample intercept and sample slope
coefficients
3
FORMULATION OF MULTIPLE REGRESSION
MODEL
4
FORMULATION OF MULTIPLE REGRESSION
MODEL
Coefficients in a multiple regression net out the impact of each independent
variable in the regression equation
The estimated slope coefficient, 𝑏 , measures the change in the average value of
𝑌 as a result of a one-unit increase in 𝑋 , holding all other independent variables
constant – “ceteris paribus effect”
5
EXAMPLE
Recall the example in the last topic, we wish to find possible factors that
affecting taxi tips in NYC.The relationship between the taxi fare and the size of
the tip is estimated using a 2-variable regression model.
Today we wish to include more factors that could possibly affect tips:
Area
number of riders
Holiday reasons
……
6
MULTIPLE LINEAR REGRESSION
Fill in the pop-up box:
7
MULTIPLE LINEAR REGRESSION
Excel’s Output:
8
MULTIPLE LINEAR REGRESSION
The estimated multiple regression equation:
𝑌 1.2649 0.9216𝑋 0.0382𝑋 17.2675𝑋 0.0288𝑋 0.1496𝑋
where 𝑌 = Taxi tips in NYC in $
𝑋 = Area indicator (NYC=1,JFK=0)
𝑋 = Number of riders
𝑋 = High tipper indicator (High=1, Normal = 0)
𝑋 = New year day indicator (Jan 1st =1, Others = 0)
𝑋 = Pre-tip amount in $
9
INTERPRETATION OF ESTIMATES
The estimated slope coefficient
𝑏 0.9216 says that the estimated average tips decrease by $0.9216 when the trip
area switching from JFK to NYC, holding all other things equal
𝑏 0.0382 says that the estimated average tips increase by $0.0382 for each
additional rider, holding all other things equal
𝑏 17.2675 says that the estimated average tips increase by $17.2675 if the rider is
categories as a high tipper, holding all other things equal
𝑏 0.0288 says that the estimated average tips increase by $0.0288 if it it on New
year day, holding all other things equal
𝑏 0.1496 says that the estimated average tips increase by $0.1496 for each $1
increase in pre-tip taxi fare, holding all other things equal
10
COMPARISON OF MODELS
Suppose we run another linear regression model only used pre-tip taxi fare and
# of riders as independent variables
11
EVALUATE THE MODEL
𝑟 and adjusted 𝑟
F-test for overall model significance
t-test for a particular 𝑋-variable significance
12
MEASURES OF VARIATION --
Total variation of the 𝑌-variable is made up of two parts
where
𝑆𝑆𝑇 ∑ 𝑌 𝑌
𝑆𝑆𝑅 ∑ 𝑌 𝑌
𝑆𝑆𝐸 ∑ 𝑌 𝑌
13
MEASURES OF VARIATION --
The blue part: SSE, the variation
attributable to factors other than
Tips
taxifare and # of riders
Taxi-fare
# of riders
14
The grey, orange and purple parts: SSR, the total variation of 𝑌-variable that being
explained by the regression equation with independent variables
MEASURES OF VARIATION --
What is the net effect of adding a new 𝑋-variable?
𝑟 increases , even if the new 𝑋-variable is explaining an insignificant proportion of the
variation of the 𝑌-variable
Is it fair to use 𝑟 for comparing models with different number of 𝑋-variables?
A degree of freedom will be lost, as a slope coefficient has to be estimated for that
new 𝑋-variable
Did the new 𝑋-variable add enough explanatory power to offset the lose of one degree of
freedom?
15
MEASURES OF VARIATION – ADJUSTED
⁄
Adjusted 𝑟 1 1 𝑟 1
⁄
Measures the proportion of variation of the 𝑌 values that is explained by the
regression equation with the independent variable 𝑋 , 𝑋 , … , 𝑋 , after the
adjustment of sample size (𝑛) and the number of 𝑋-variables used (𝐾)
Smaller than 𝑟 , and can be negative
Penalize the excessive use of 𝑋-variables
Useful in comparing among models with different number of 𝑋-variables
16
EXAMPLE
Compare the model only used pre-tip amount against the model used 5
independent variables, which one fits better?
Number of Observations: 197,103 vs 197,103
Degree of freedom (n-K-1): 197,101 vs 197,097
𝑟 : 0.5533 vs 0.6075
Adjusted 𝑟 : 0.5533 vs 0.6075
17
INFERENCE: OVERALL MODEL SIGNIFICANCE
F-test for the overall model significance
𝐻 :𝛽 𝛽 ⋯ 𝛽 0
(none of the 𝑋-variables affects 𝑌)
𝐻 : 𝐴𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝛽 0
(at least one 𝑋-variable affects 𝑌)
/
F with 𝐾, 𝑛 𝐾 1 d.f.
/
p-value 𝑃 𝐹 F
Reject 𝐻 if F > C. V. 𝐹 , , or p-value < 𝛼
18
INFERENCE:A PARTICULAR X-VARIABLE
SIGNIFICANCE
By rejecting the 𝐻 in F-test, we still cannot distinguish which 𝑋-variable(s) has
the significant impacts on the 𝑌-variable
t-test for a particular 𝑋-variable significance
𝐻 :𝛽 0 (𝑋 has no linear relationship with 𝑌)
𝐻 :𝛽 0 (𝑋 is linearly related to 𝑌)
t with 𝑛 𝐾 1 d.f.
p-value 𝑃 𝑡 |t|
Reject 𝐻 if |t| > C. 𝑉. 𝑡 ⁄ , or p-value < 𝛼
19
EXAMPLE
For the model used 5 independent variables, is the overall model significant?
F = 61015.62, p-value < 0.00001;
At 5% significance level, 𝐻 is rejected
20
EXAMPLE
At 5% level of significance, which X-variable(s), significantly affecting Y?
According to the t-test results, the p-value for all five independent variables are
smaller than 5%, indicating all of them are significantly related to tips paid in
NYC.
21
VARIABLES SELECTION STRATEGIES
In case some of the independent variables are insignificant based on t-test
results, one may consider eliminated them using the following methods
All possible regressions
Backward elimination
Forward selection
Stepwise regression
22
ALL POSSIBLE REGRESSIONS
To develop all the possible regression models between the dependent variable
and all possible combinations of independent variables
If there are 𝐾 𝑋-variables to consider using, there are 2 1 possible
regression models to be developed
The criteria for selecting the best model may include
MSE
Adjusted 𝑟
Disadvantages of all possible regressions
No unique conclusion, with different criteria, different conclusions will arise
Look at overall model performance, but not individual variable significance
23
When there is a large number of potential 𝑋-variables, computational time can be long
BACKWARD ELIMINATION
Evaluate individual variable significance
Step 1: Build a model by using all potential 𝑋-variables
Step 2: Identify the least significant 𝑋-variable using t-test
Step 3: Remove this 𝑋-variable if its p-value is larger than the specified level of
significance; otherwise terminate the procedure
Step 4: Develop a new regression model after removing this 𝑋-variable, repeat steps
2 and 3 until all remaining 𝑋-variables are significant
24
FORWARD SELECTION
Evaluate individual variable significance
Step 1: Start with a model which only contain the intercept term
Step 2: Identify the most significant 𝑋-variable using t-test
Step 3: Add this 𝑋-variable if its p-value is smaller than the specified level of
significance; otherwise terminate the procedure
Step 4: Develop a new regression model after including this 𝑋-variable, repeat steps
2 and 3 until all significant 𝑋-variables are entered
25
STEPWISE REGRESSION
Evaluate individual variable significance
A 𝑋-variable entering can later leave; a 𝑋-variable eliminated can later go back in
Step 1: Start with a model which only contain the intercept term
Step 2: Identify the most significant 𝑋-variable, add this 𝑋-variable if its p-value is
smaller than the specified level of significance; otherwise terminate the procedure
Step 3: Identify the least significant 𝑋-variable from the model, remove this 𝑋-
variable if its p-value is larger than the specified level of significance
Step 4: Repeat steps 2 and 3 until all significant 𝑋-variables are entered and none of
them have to be removed
26
PRINCIPLE OF MODEL BUILDING
A good model should
Have few independent variables
Have high predictive power
Have low correlation between independent variables
Be easy to interpret
27