0% found this document useful (0 votes)

18 views35 pages

Lecture 3

cityu hk

Uploaded by

rub.crecycle

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views35 pages

Lecture 3

cityu hk

Uploaded by

rub.crecycle

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

LECTURE 3

REGRESSION ANALYSIS
- MULTIPLE REGRESSION

1
AGENDA

 Last class:
 𝑌෡𝑖 = 0.326 + 0.1578 𝑋𝑖  For every $1 increase in taxi fare, what can we expect?
 𝑟 2 = 0.5533  What does it say about our model?
 𝐻0 : 𝛽1 = 0  p-value is very, very close to 0, which implies…

 Basic Concepts of Multiple Linear Regression

 Using Categorical (Dummy) Variables
 Measures of Variation and Statistical Inference

2
FORMULATION OF MULTIPLE REGRESSION
MODEL

 A multiple regression model is to relate one dependent variable with two or more
independent variables in a linear function
Population Intercept Population Slope Coefficients

𝑌𝑖 = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝛽2 𝑋2𝑖 + ⋯ + 𝛽𝐾 𝑋𝐾𝑖 + 𝜀𝑖

Dependent Variable Independent Variable Random Error

 K is the number of independent variables (e.g., K = 1 for simple linear regression)

 𝛽0 , 𝛽1 , 𝛽2 … , 𝛽𝐾 are the K+1 parameters in a multiple regression model with K independent
variables
 𝑏0 , 𝑏1 , 𝑏2 … , 𝑏𝐾 are used to represent sample intercept and sample slope coefficients
3
MULTIPLE REGRESSION, 2 EXPLANATORY
VARIABLES

 Say we have 𝑛 data points or 𝑛 observations

 Our observations are in the form 𝑋11 , 𝑋21 , 𝑌1 , 𝑋12 , 𝑋22 , 𝑌2 , … , 𝑋1𝑛 , 𝑋2𝑛 , 𝑌𝑛

Observati Taxi – Pre- Ratecode ID Taxi - Tips (𝑿𝟏𝒊 , 𝑿𝟐𝒊 , 𝒀𝒊 )

on # tipped fare 1=NYC,
2=JFK
#1 8.30 1 1.65 (8.30, 1, 1.65)

#2 15.30 1 1.00 (15.30, 1, 1.00)

#3 7.80 1 1.25 (7.80, 1, 1.25)

#27 52.80 2 5.00 (14.80, 2, 3.70)

Source: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
TLC Trip Record Data: January 2019 Yellow Taxi Trip Records
Published by NYC Taxi & Limousine Commission We will need to “fix” this later…
FORMULATION OF MULTIPLE REGRESSION
MODEL

5
FORMULATION OF MULTIPLE REGRESSION
MODEL

 Coefficients in a multiple regression net out the impact of each independent

variable in the regression equation
 The estimated slope coefficient, 𝑏𝑗 , measures the change in the average value of
𝑌 as a result of a one-unit increase in 𝑋𝑗 , holding all other independent variables
constant – “ceteris paribus effect”
remain constant

෡ ∙ = 𝑏0 + 𝑏1 𝑋1∙ + 𝑏2 𝑋2∙ + ⋯ + 𝒃𝒋 𝑋𝑗∙ + ⋯ + 𝑏𝐾 𝑋𝐾∙

𝒀

6
EXAMPLE – USING CATEGORICAL (DUMMY)
VARIABLES

 Last time, we did a simple linear regression on taxi fare and tips.
 We want to see if the location also affects the tip.
 Column E (RatecodeID) has 2 possibilities: 1= New York City, 2 = JFK Airport
 Can we use column E as-is? Consider two trips from NYC and JFK, both with
fares of $10.

Observation Taxi – Pre- Ratecode ID What the model looks like

#𝒊 tipped fare 1=NYC, 2=JFK ෡ 𝒊 = 𝑏0 + 𝑏1 𝑋1𝒊 + 𝑏2 𝑋2𝒊
𝒀
𝑿𝟏𝒊 𝑿𝟐𝒊
e.g.1 10.00 1 𝑌෡1 = 𝑏0 + 10 𝑏1 + 𝑏2

e.g.2 10.00 2 𝑌෡2 = 𝑏0 + 10 𝑏1 + 2𝑏2 7

𝑏2 vs 2𝑏2 ? Double
the bonus?
USING CATEGORICAL (DUMMY) VARIABLES

 Column E (RatecodeID) has 2 possibilities: 1= New York City, 2 = JFK Airport

 Let’s define a new column: AreaID. We are “inside” the area if we are in NYC,
“outside” the area if we are NOT in NYC (i.e. JFK, etc).
 We can pre-process the data so that 𝑋2𝑖 = 1 if we are inside NYC and 𝑋2𝑖 = 0
if we are outside NYC

Observation Taxi – Pre- Ratecode ID What the model looks like

#𝒊 tipped fare 1=NYC, 0=JFK ෡ 𝒊 = 𝑏0 + 𝑏1 𝑋1𝒊 + 𝑏2 𝑋2𝒊
𝒀
𝑿𝟏𝒊 𝑿𝟐𝒊
e.g.1 10.00 1 𝑌෡1 = 𝑏0 + 10 𝑏1 + 𝑏2
8

e.g.2 10.00 20 𝑌෡2 = 𝑏0 + 10 𝑏1

USING CATEGORICAL (DUMMY) VARIABLES
 𝑋2𝑖 = 1 if we are inside NYC and 𝑋2𝑖 = 0 if we are outside NYC
 Interpretation:
 If 𝑏2 > 0: Everything else remaining constant, we expect to receive a bonus tip of
$|𝑏2 | when we pick up a passenger in NYC
 If 𝑏2 < 0: Everything else remaining constant, we expect our tip to reduce by $|𝑏2 |
when we pick up a passenger in NYC.
 This variable incorporates a fixed tip amount for NYC vs non-NYC trips, NOT a
change in the tips %!

Observation Taxi – Pre- Ratecode ID What the model looks like

#𝒊 tipped fare 1=NYC, 0=JFK ෡ 𝒊 = 𝑏0 + 𝑏1 𝑋1𝒊 + 𝑏2 𝑋2𝒊
𝒀
𝑿𝟏𝒊 𝑿𝟐𝒊
e.g.1 10.00 1 𝑌෡1 = 𝑏0 + 10 𝑏1 + 𝑏2
9

e.g.2 10.00 20 𝑌෡2 = 𝑏0 + 10 𝑏1

USING CATEGORICAL (DUMMY) VARIABLES

 Useful when an explanatory variable isn’t numerical (e.g. colours, locations)

 Use 0, 1 variables: 0 = “is not, does not fit definition”, 1 = “is, fits definition”
 If a category has 𝑐 choices, then we need 𝑐 − 1 categorical variables
 E.g. Product design: A product can be red, yellow, or blue. We want to see how
colour affects popularity. In a regression model, we need 2 categorical variable
 𝑋1 = 1 if it is red, and 0 otherwise
 𝑋2 = 1 if it is yellow, and 0 otherwise

Obs # 𝒊 Red? Yellow? What the model looks like

𝑿𝟏𝒊 𝑿𝟐𝒊 ෡ 𝒊 = 𝑏0 + 𝑏1 𝑋1𝒊 + 𝑏2 𝑋2𝒊 + ⋯
𝒀

e.g.1 (Red) 1 0 𝑌෡1 = 𝑏0 + 𝑏1 + ⋯

10
e.g.2 (Yellow) 0 1 𝑌෡2 = 𝑏0 + 𝑏2 + ⋯
e.g. 3 (Blue) 0 0 𝑌෡3 = 𝑏0 + ⋯
BUILDING THE MODEL
 After fixing the categorical variable for AreaID, we can fill in the regression
window.

11
MODEL OUTPUT
 Excel’s Output:

෡ = 𝟏. 𝟑𝟕𝟕𝟏 + 𝟎. 𝟏𝟒𝟖𝟖 𝑿𝟏 − 𝟎. 𝟗𝟓𝟐𝟏 𝑿𝟐

𝒀

*Scientific notation: 1.7284E − 226 = 1.7284 × 10−226 ≈ 0

INTERPRETATION OF ESTIMATES

 The estimated multiple regression equation:

𝑌෠ = 1.3771 + 0.1488 𝑋1 − 0.9521 𝑋2
 𝑌෠ = Estimated taxi tips in NYC in $
 𝑋1 = Pre-tip amount in $
 𝑋2 = Area indicator (NYC =1, Non-NYC (JFK) = 0)
 Interpretation of the estimated slope coefficient:
 𝑏1 = 0.1488 says that the estimated average tips increase by $0.1488 for each $1
increase in pre-tip taxi fare, given that other independent variables remain constant
 𝑏2 = −0.9521 says that the estimated average tips decrease by $0.952 when starts in
NYC instead of JFK, given that other independent variables remain constant
13
COMPARISON OF MODELS
 Suppose we add more explanatory variables
 𝑋1 = Pre-tip amount in $
 𝑋2 = Area indicator (NYC =1, Non-NYC (JFK) = 0)
 𝑋3 = # of riders
 𝑋4 = New Year’s Day indicatory (Jan 1 =1, otherwise =0)

෡
𝒀
= 𝟏. 𝟑𝟏𝟖𝟏 + 𝟎. 𝟏𝟒𝟖𝟓 𝑿𝟏 − 𝟎. 𝟗𝟓𝟎𝟏 𝑿𝟐
+ 𝟎. 𝟎𝟒𝟎𝟒𝑿𝟑 + 𝟎. 𝟎𝟓𝟎𝟑𝑿𝟒

14
INTERPRETATION OF ESTIMATES

 Multiple regression model:

𝑌෠ = 1.3181 + 0.1485 𝑋1 − 0.9501 𝑋2 + 0.0404𝑋3 + 0.0503𝑋4
 The estimated slope coefficient
 𝑏1 = 0.1485 says that the estimated average tips increase by $0.1485 for each $1
increase in pre-tip taxi fare, holding all other things equal
 𝑏2 = −0.9521 says that the estimated average tips decrease by $0.952 when starts in
NYC instead of non-NYC (JFK), holding all other things equal
 𝑏3 = 0.0404 says that the estimated average tips increase by $0.0404 for each
additional rider, holding all other things equal
 𝑏4 = 0.0503 says that the estimated average tips increase by $0.0503 if it it on New
year day, holding all other things equal
15
EVALUATE THE MODEL

 𝑟 2 and adjusted 𝑟 2
 F-test for overall model significance
 t-test for a particular 𝑋-variable significance

16
MEASURES OF VARIATION - 𝑟 2

 𝑌෠ = 1.3181 + 0.1485 𝑋1 − 0.9501 𝑋2 + 0.0404𝑋3 + 0.0503𝑋4

 Total variation of the 𝑌-variable is made up of two parts

𝑆𝑆𝑇 = 𝑆𝑆𝑅 + 𝑆𝑆𝐸

where
ത 2
𝑆𝑆𝑇 = σ𝑛𝑖=1(𝑌𝑖 − 𝑌) SSR - regression SSE - error
𝑆𝑆𝑅 = σ𝑛𝑖=1(𝑌෠𝑖 − 𝑌)
ത 2 𝑌ത 𝑌෠𝑖 𝑌𝑖
𝑆𝑆𝐸 = σ𝑛𝑖=1(𝑌𝑖 − 𝑌෠𝑖 )2

Pre-tip New Year’s

fare Area # of Day 17
passengers
MEASURES OF VARIATION - 𝑟 2

 We can ALWAYS increase 𝑟 2 by adding variables that don’t explain the changes in 𝑌
 Easier to see with less data. See “r-squared comparison” tab in spreadsheet
 We add one more column of 0/1s. 1 = odd number row, 0 = even number row

Vs.

18
MEASURES OF VARIATION - 𝑟 2

 What is the net effect of adding a new 𝑋-variable?

 𝑟 2 increases , even if the new 𝑋-variable is explaining an insignificant proportion of the
variation of the 𝑌-variable
 Is it fair to use 𝑟 2 for comparing models with different number of 𝑋-variables?

 A degree of freedom* will be lost, as a slope coefficient has to be estimated for that
new 𝑋-variable
 Did the new 𝑋-variable add enough explanatory power to offset the loss of one degree of
freedom?

 Degree of freedom on the residual = 𝑛 − 𝐾 + 1 = 𝑛 − 1 − 𝐾

*Degrees of freedom: Number of independent pieces of information (data values) in the random sample.
If 𝐾 + 1 parameters (intercept, slopes) must be estimated before the sum of squares errors, SSE, can be calculated from a sample of size
n, the degrees of freedom are equal to 𝑛 − (𝐾 + 1) (𝐾 + 1 coefficients of b0, b1, …, bK).
19
MEASURES OF VARIATION – ADJUSTED 𝑟 2

𝑆𝑆𝐸
(Recall: 𝑟 2 = 1 − 𝑆𝑆𝑇)
𝑆𝑆𝐸Τ 𝑛−𝐾−1 𝑛−1
 Adjusted 𝑟 2 = 1 − = 1− (1 − 𝑟 2 )
𝑆𝑆𝑇Τ 𝑛−1 𝑛−𝐾−1

 Measures the proportion of variation of the 𝑌 values that is explained by the

regression equation with the independent variable 𝑋1 , 𝑋2 , … , 𝑋𝐾 , after the
adjusting for sample size (𝑛) and the number of 𝑋-variables used (𝐾)
 Smaller than or equal to 𝑟 2 , and can be negative
 Penalize the excessive use of 𝑋-variables
 Useful in comparing among models with different number of 𝑋-variables

20
EXAMPLE – ADJUSTED 𝑟 2
 Compare the models that we’ve built
 Number of Observations: 197,103
 SST: 1,163,798

1 explanatory 2 explanatory 4 variables

variable (pre-tip variables (pre-tip
fare) fare, area ID)

Degree of freedom – 197,101 197,100 197,098

residual
SSE 519,852 517,136 516,911
𝑟2 0.553314 0.555647 0.555841
21
Adjusted 𝑟2 0.553312 0.555643 0.555832
INFERENCE: OVERALL MODEL SIGNIFICANCE

 Is the model significant? Do we need a model?

 F-test

22
OVERALL MODEL SIGNIFICANCE: F-TEST

 F-test for the overall model significance

 Null hypothesis 𝐻0 : 𝛽1 = 𝛽2 = ⋯ = 𝛽𝐾 = 0 (none of the 𝑋-variables affects 𝑌)

 Alternative hypothesis: 𝐻1 : 𝐴𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝛽𝑖 ≠ 0 (at least one 𝑋-variable affects 𝑌)

 We want to REJECT the null hypothesis by showing that the probability of seeing
our value of 𝑏1 , 𝑏2 , … , 𝑏𝐾 is “low” if it 𝐻0 was indeed true.

 F-statistic : For SSR For SSE

𝑀𝑆𝑅 𝑆𝑆𝑅/𝐾
F = 𝑀𝑆𝐸 = 𝑆𝑆𝐸/(𝑛−𝐾−1) with 𝐾, (𝑛 − 1 − 𝐾) degrees of freedom (d.f.)

23
OVERALL MODEL SIGNIFICANCE: F-TEST

𝑀𝑆𝑅 𝑆𝑆𝑅/𝐾
 F= = 𝑆𝑆𝐸/(𝑛−𝐾−1) with 𝐾, (𝑛 − 1 − 𝐾) degrees of freedom (d.f.)
𝑀𝑆𝐸

 First decide on size of rejection region 𝛼 (one tails)  Level of significance

 Method 1 (with F-table): Rejection region approach

 Reject 𝐻0 if F > critical value (C.V.) = 𝐹𝛼,𝐾,(𝑛−𝐾−1)

 Method 2 (with Excel output): p-value approach

 p-value = 𝑃(𝐹 ≥ F)

 Reject 𝐻0 if p-value < 𝛼

24
OVERALL MODEL SIGNIFICANCE: F-TEST
Probability distribution of F. Suppose 𝛼 = 0.05

At 5% significance level, p-value  0 < 5%. Therefore 𝐻0

is rejected.

 = tail area = P(F ≥ C. V.)

𝐩 − 𝐯𝐚𝐥𝐮𝐞 = P(𝐹 ≥ F)

F
0 C. V. = F =61,664, calculated
𝐹𝛼,𝐾,(𝑛−𝐾−1) =
from sample data
𝟐. 𝟑𝟕
25
SIGNIFICANCE OF A PARTICULAR X-VARIABLE:
T-TEST
 Even if we reject the 𝐻0 in our F-test, we cannot distinguish which 𝑋-variable(s)
has a significant impact on the 𝑌-variable
 t-test for a particular 𝑋-variable’s significance
 Null 𝐻0 : 𝛽𝑖 = 0 (𝑋𝑖 has no linear relationship with 𝑌, given presence of other 𝑋-
variable(s))
 Alternative 𝐻1 : 𝛽𝑖 ≠ 0 (𝑋𝑖 is linearly related to 𝑌, given presence of other 𝑋-
variable(s))

26
SIGNIFICANCE OF A PARTICULAR X-VARIABLE:
T-TEST

 Null 𝐻0 : 𝛽1 = 0
 Method 1: Rejection region approach
 Reject 𝐻0 if T > C. V. = 𝑡𝛼Τ2,(𝑛−𝐾−1)

 Method 2: p-value approach

 p-value = 𝑃(|T| ≥ |t|)
 Reject 𝐻0 if p-value < 𝛼

Student’s t-distribution
Probability

standard 𝛼
𝛼
error
2 2
If 𝛼 = 5%,
𝑡 then 𝑡0.025 ,(𝑛−5) ≈
-348.81 C.V.= 𝜷𝟏 = 𝟎 C.V.= t=348.81 27
-1.96 1.96 1.96
EXAMPLE

 Conclusion: p-value is smaller than 5%, so reject 𝐻0 . The pre-tip fare is significantly
related to the tips, given presence of other 𝑋-variables.
 What about the other variables?

*
*

 According to the t-test results, the p-value for each of the four explanatory variables
is smaller than 5%,.
 This indicates each explanatory variable is significantly related to tips paid in NYC,
given presence of other 𝑋-variables.

*Scientific notation: 6.41657E − 08 = 6.41657 × 10−8 = 0.0000000642 ≈ 0

28
EXAMPLE

 What does the table look like if there is an insignificant explanatory variable?
 Added fifth variable to labels rows as “odd” or “even” (“5var – odd/even” tab)

 The p-value for “Odd/Even transaction” is LARGER than 5%, so we cannot reject
𝐻0 . This indicates that odd/even transactions is not significantly related to tips
paid in NYC, given presence of other 𝑋-variables. 29
VARIABLES SELECTION STRATEGIES

 Some of the independent variables are insignificant based on t-test results

 We may consider eliminating insignificant independent variables using the following
methods:
 All possible regressions
 Backward elimination
 Forward selection
 Stepwise regression

30
ALL POSSIBLE REGRESSIONS

 To develop all the possible regression models between the dependent variable
and all possible combinations of independent variables
 If there are 𝐾 𝑋-variables to consider using, there are (2𝐾 −1) possible
regression models to be developed
 The criteria for selecting the best model may include
 Mean Sum of Squares Errors (MSE)
 Adjusted 𝑟 2
 Disadvantages of all possible regressions
 No unique conclusion, with different criteria, different conclusions will arise
 Look at overall model performance, but not individual variable significance
 When there is a large number of potential 𝑋-variables, computational time can be long

31
BACKWARD ELIMINATION

 Evaluate individual variable significance 𝑋1 , 𝑋2 , 𝑋3 , 𝑋4 , 𝑋5

Step 1: Build a model by using all potential 𝑋-variables 𝑋1 , 𝑋2 , 𝑋3 , 𝑋4

Step 2: Identify the least significant 𝑋-variable using t-test
Step 3: Remove this 𝑋-variable if its p-value is larger than the specified level of
significance; otherwise terminate the procedure
Step 4: Develop a new regression model after removing this 𝑋-variable, repeat
steps 2 and 3 until all remaining 𝑋-variables are significant

32
FORWARD SELECTION
nothing
 Evaluate individual variable significance

𝑋1 𝑋2 𝑋3 𝑋4 𝑋5
Step 1: Start with a model which only contains the intercept term
Step 2: Identify the most significant 𝑋-variable using t-test 𝑋1 , 𝑋2 𝑋1 , 𝑋3 𝑋1 , 𝑋4 𝑋1 , 𝑋5
Step 3: Add this 𝑋-variable if its p-value is smaller than the specified
level of significance; otherwise terminate the procedure
Step 4: Develop a new regression model after including this 𝑋-variable,
repeat steps 2 and 3 until all significant 𝑋-variables are entered

33
STEPWISE REGRESSION

 Evaluate individual variable significance

 An 𝑋-variable entering can later leave; an 𝑋-variable eliminated can later go back in

Step 1: Start with a model which only contains the intercept term
Step 2: Identify the most significant 𝑋-variable, add this 𝑋-variable if its p-value is smaller
than the specified level of significance; otherwise terminate the procedure
Step 3: Identify the least significant 𝑋-variable from the model, remove this 𝑋-variable if
its p-value is larger than the specified level of significance
Step 4: Repeat steps 2 and 3 until all significant 𝑋-variables are entered and none of them
have to be removed

34
PRINCIPLE OF MODEL BUILDING

 A good model should

 Have few independent variables
 Have high predictive power
 Have low correlation between independent variables
 Be easy to interpret

Lecture 3
No ratings yet
Lecture 3
27 pages
Multiple Regression Analysis Guide
No ratings yet
Multiple Regression Analysis Guide
19 pages
Multiple Linear Regression & Nonlinear Regression Models
No ratings yet
Multiple Linear Regression & Nonlinear Regression Models
51 pages
Multiple Regression A
No ratings yet
Multiple Regression A
32 pages
CUHK STAT5102 Ch3
No ratings yet
CUHK STAT5102 Ch3
73 pages
13 Predictive Analysis - Tests of Association - Regression
No ratings yet
13 Predictive Analysis - Tests of Association - Regression
70 pages
Multiple Linear Regressioin Part 1
0% (1)
Multiple Linear Regressioin Part 1
27 pages
Bivariate
No ratings yet
Bivariate
28 pages
STAT630Slide Adv Data Analysis
0% (1)
STAT630Slide Adv Data Analysis
238 pages
Regression
No ratings yet
Regression
24 pages
03 - Simple Linear Regression
No ratings yet
03 - Simple Linear Regression
13 pages
01 SLR Final
No ratings yet
01 SLR Final
37 pages
Multiple Linear Regression
No ratings yet
Multiple Linear Regression
73 pages
Session 15 Regression and Correlation
No ratings yet
Session 15 Regression and Correlation
66 pages
Regression Analysis for Researchers
No ratings yet
Regression Analysis for Researchers
26 pages
Chapter 3 MLR
No ratings yet
Chapter 3 MLR
40 pages
Etman MachineL4
No ratings yet
Etman MachineL4
55 pages
Regression
No ratings yet
Regression
56 pages
Statics Thinking-Regression
No ratings yet
Statics Thinking-Regression
51 pages
Regression Models Notes
No ratings yet
Regression Models Notes
13 pages
Part 8 Linear Regression
No ratings yet
Part 8 Linear Regression
6 pages
Simple Linear Regression Sample
No ratings yet
Simple Linear Regression Sample
55 pages
Regression Analysis
No ratings yet
Regression Analysis
65 pages
CH 2
No ratings yet
CH 2
31 pages
Multiple Regression
No ratings yet
Multiple Regression
49 pages
Week 5 Multiple Regression: Busa3500 Statistics For Business Ii Piedmont College
No ratings yet
Week 5 Multiple Regression: Busa3500 Statistics For Business Ii Piedmont College
57 pages
Raw Introduction to Linear Regression (서울대 회귀분석 강의노트)
No ratings yet
Raw Introduction to Linear Regression (서울대 회귀분석 강의노트)
226 pages
Lecture 12
No ratings yet
Lecture 12
47 pages
Regression Model
No ratings yet
Regression Model
30 pages
Regression and Correlation
No ratings yet
Regression and Correlation
66 pages
Regression
No ratings yet
Regression
64 pages
125.785 Module 2.2
No ratings yet
125.785 Module 2.2
95 pages
Part 11 Multiple Linear Regression - Pdf.crdownload
No ratings yet
Part 11 Multiple Linear Regression - Pdf.crdownload
41 pages
DISC 212 Session 13
No ratings yet
DISC 212 Session 13
29 pages
Chapter 3
No ratings yet
Chapter 3
36 pages
Lecture9 Regression
No ratings yet
Lecture9 Regression
24 pages
Linear Regression PDF
100% (1)
Linear Regression PDF
32 pages
SRM Notes
50% (2)
SRM Notes
38 pages
Study Material For Machine Learning - 2 - 1754721786127
No ratings yet
Study Material For Machine Learning - 2 - 1754721786127
22 pages
11 Bda
No ratings yet
11 Bda
25 pages
Module05 Notes
No ratings yet
Module05 Notes
19 pages
Evans Analytics2e PPT 08
No ratings yet
Evans Analytics2e PPT 08
65 pages
Introduction To Management Science: Post Mid Sessions 2 & 3 November 4 and 6 2019
No ratings yet
Introduction To Management Science: Post Mid Sessions 2 & 3 November 4 and 6 2019
26 pages
Week 03 Regression
No ratings yet
Week 03 Regression
14 pages
Stat 353 Study Guide
No ratings yet
Stat 353 Study Guide
44 pages
试卷2
No ratings yet
试卷2
16 pages
CH 06
No ratings yet
CH 06
22 pages
Slides - Simple Linear Regression
No ratings yet
Slides - Simple Linear Regression
35 pages
9 Regression (Statistics IEM 2-2)
No ratings yet
9 Regression (Statistics IEM 2-2)
32 pages
Multiple Regression Insights
100% (1)
Multiple Regression Insights
21 pages
Linear Regression
No ratings yet
Linear Regression
23 pages
Regression Linear
No ratings yet
Regression Linear
24 pages
Chap 10 Regression Analysis
No ratings yet
Chap 10 Regression Analysis
68 pages
Week-4 BA Linear Regression
No ratings yet
Week-4 BA Linear Regression
16 pages
Section 2
No ratings yet
Section 2
22 pages
Chapter 3
No ratings yet
Chapter 3
31 pages
Week 8 - 10
No ratings yet
Week 8 - 10
72 pages
Simple Regression
No ratings yet
Simple Regression
46 pages
1 Regenerative Processes: 1.1 Examples
No ratings yet
1 Regenerative Processes: 1.1 Examples
10 pages
Us
No ratings yet
Us
9 pages
Causes of Communalism in India
No ratings yet
Causes of Communalism in India
35 pages
Yield Observation Outlier Detection With Unsupervised Machine Learning in Harvest Machines
No ratings yet
Yield Observation Outlier Detection With Unsupervised Machine Learning in Harvest Machines
7 pages
Contemporary Business 15th Edition Louis E. Boone Instant Download
100% (7)
Contemporary Business 15th Edition Louis E. Boone Instant Download
136 pages
Case Problem FORECASTING SALES
No ratings yet
Case Problem FORECASTING SALES
1 page
Stat 1123 Group 5
No ratings yet
Stat 1123 Group 5
41 pages
Statistic Week 1-20by Jayson
No ratings yet
Statistic Week 1-20by Jayson
106 pages
Introduction to Statistics Basics
100% (1)
Introduction to Statistics Basics
36 pages
Evaluating Quality Improvement Strategies
No ratings yet
Evaluating Quality Improvement Strategies
7 pages
The Planetary Garden and Other Writings Gilles Clement Complete Edition
100% (2)
The Planetary Garden and Other Writings Gilles Clement Complete Edition
109 pages
Sampling Methods for Researchers
No ratings yet
Sampling Methods for Researchers
10 pages
Clinical Cases in Orthodontics 1st Edition M. T. Cobourne Download
100% (11)
Clinical Cases in Orthodontics 1st Edition M. T. Cobourne Download
62 pages
School Principals' Commitment Study
No ratings yet
School Principals' Commitment Study
8 pages
RESEARCH 2 Q1 WEEK 1-2 (Grade 12)
No ratings yet
RESEARCH 2 Q1 WEEK 1-2 (Grade 12)
10 pages
Brcko Et Al. (2013) Taxonomy and Distribution of The Salamander Genus Bolitoglossa in Brazilian Amazonia
No ratings yet
Brcko Et Al. (2013) Taxonomy and Distribution of The Salamander Genus Bolitoglossa in Brazilian Amazonia
31 pages
Radar Meteorology: A First Course 1st Edition Rauber PDF Download
100% (5)
Radar Meteorology: A First Course 1st Edition Rauber PDF Download
78 pages
Lecture Note Chapter 10
No ratings yet
Lecture Note Chapter 10
26 pages
Brainwave Entrainment 0001 PDF
No ratings yet
Brainwave Entrainment 0001 PDF
11 pages
Stat332 IACovid19 April30 2020
No ratings yet
Stat332 IACovid19 April30 2020
3 pages
Engineering Hydrology 1st Edition Rajesh Srivastava All Chapters Available
100% (2)
Engineering Hydrology 1st Edition Rajesh Srivastava All Chapters Available
147 pages
Probability Distributions Guide
No ratings yet
Probability Distributions Guide
42 pages
Maths For AI
No ratings yet
Maths For AI
2 pages
Digitalisation L Islam Est Il en Reste 1730375563
No ratings yet
Digitalisation L Islam Est Il en Reste 1730375563
69 pages
STARD 2015 Checklist PDF
No ratings yet
STARD 2015 Checklist PDF
2 pages
Us 603 - 190040048 - QP
No ratings yet
Us 603 - 190040048 - QP
3 pages
2.5 the Negative Binomial Distribution 習題
No ratings yet
2.5 the Negative Binomial Distribution 習題
2 pages
Statistical Analysis for Business
0% (2)
Statistical Analysis for Business
4 pages
Aspiring Data Scientist Profile
No ratings yet
Aspiring Data Scientist Profile
1 page
Development of TOPS 2 Short Form
No ratings yet
Development of TOPS 2 Short Form
22 pages

Lecture 3

Uploaded by

Lecture 3

Uploaded by

LECTURE 3

 Basic Concepts of Multiple Linear Regression

𝑌𝑖 = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝛽2 𝑋2𝑖 + ⋯ + 𝛽𝐾 𝑋𝐾𝑖 + 𝜀𝑖

Dependent Variable Independent Variable Random Error

 K is the number of independent variables (e.g., K = 1 for simple linear regression)

 Say we have 𝑛 data points or 𝑛 observations

Observati Taxi – Pre- Ratecode ID Taxi - Tips (𝑿𝟏𝒊 , 𝑿𝟐𝒊 , 𝒀𝒊 )

#2 15.30 1 1.00 (15.30, 1, 1.00)

#3 7.80 1 1.25 (7.80, 1, 1.25)

#27 52.80 2 5.00 (14.80, 2, 3.70)

 Coefficients in a multiple regression net out the impact of each independent

෡ ∙ = 𝑏0 + 𝑏1 𝑋1∙ + 𝑏2 𝑋2∙ + ⋯ + 𝒃𝒋 𝑋𝑗∙ + ⋯ + 𝑏𝐾 𝑋𝐾∙

Observation Taxi – Pre- Ratecode ID What the model looks like

e.g.2 10.00 2 𝑌෡2 = 𝑏0 + 10 𝑏1 + 2𝑏2 7

 Column E (RatecodeID) has 2 possibilities: 1= New York City, 2 = JFK Airport

Observation Taxi – Pre- Ratecode ID What the model looks like

e.g.2 10.00 20 𝑌෡2 = 𝑏0 + 10 𝑏1

Observation Taxi – Pre- Ratecode ID What the model looks like

e.g.2 10.00 20 𝑌෡2 = 𝑏0 + 10 𝑏1

 Useful when an explanatory variable isn’t numerical (e.g. colours, locations)

Obs # 𝒊 Red? Yellow? What the model looks like

e.g.1 (Red) 1 0 𝑌෡1 = 𝑏0 + 𝑏1 + ⋯

෡ = 𝟏. 𝟑𝟕𝟕𝟏 + 𝟎. 𝟏𝟒𝟖𝟖 𝑿𝟏 − 𝟎. 𝟗𝟓𝟐𝟏 𝑿𝟐

*Scientific notation: 1.7284E − 226 = 1.7284 × 10−226 ≈ 0

 The estimated multiple regression equation:

 Multiple regression model:

 𝑌෠ = 1.3181 + 0.1485 𝑋1 − 0.9501 𝑋2 + 0.0404𝑋3 + 0.0503𝑋4

𝑆𝑆𝑇 = 𝑆𝑆𝑅 + 𝑆𝑆𝐸

Pre-tip New Year’s

 What is the net effect of adding a new 𝑋-variable?

 Degree of freedom on the residual = 𝑛 − 𝐾 + 1 = 𝑛 − 1 − 𝐾

 Measures the proportion of variation of the 𝑌 values that is explained by the

1 explanatory 2 explanatory 4 variables

Degree of freedom – 197,101 197,100 197,098

 Is the model significant? Do we need a model?

 F-test for the overall model significance

 Null hypothesis 𝐻0 : 𝛽1 = 𝛽2 = ⋯ = 𝛽𝐾 = 0 (none of the 𝑋-variables affects 𝑌)

 Alternative hypothesis: 𝐻1 : 𝐴𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝛽𝑖 ≠ 0 (at least one 𝑋-variable affects 𝑌)

 F-statistic : For SSR For SSE

 First decide on size of rejection region 𝛼 (one tails)  Level of significance

 Method 1 (with F-table): Rejection region approach

 Reject 𝐻0 if F > critical value (C.V.) = 𝐹𝛼,𝐾,(𝑛−𝐾−1)

 Method 2 (with Excel output): p-value approach

 Reject 𝐻0 if p-value < 𝛼

At 5% significance level, p-value  0 < 5%. Therefore 𝐻0

 = tail area = P(F ≥ C. V.)

 Method 2: p-value approach

*Scientific notation: 6.41657E − 08 = 6.41657 × 10−8 = 0.0000000642 ≈ 0

 Some of the independent variables are insignificant based on t-test results

 Evaluate individual variable significance 𝑋1 , 𝑋2 , 𝑋3 , 𝑋4 , 𝑋5

Step 1: Build a model by using all potential 𝑋-variables 𝑋1 , 𝑋2 , 𝑋3 , 𝑋4

 Evaluate individual variable significance

 A good model should

You might also like