IIMT 2641 Introduction to Business Analytics
Module 3: Linear Regression
Topic 1: Simple Linear Regression
1
Bordeaux wine
§ Large differences in price and quality between years, although wine is
produced in a similar way
§ Meant to be aged, so hard to tell if wine will be good when it is on the
market
§ Expert tasters predict which ones will be good
§ Can analytics be used to come up with a different system for judging wine?
3
Predicting the quality of wine
§ March 1990 - Orley Ashenfelter, a Princeton economics professor, claims
he can predict wine quality without tasting the wine
4
Building a model
§ Ashenfelter used a method called linear regression
– Predicts an outcome variable, or dependent variable
– Predicts using a set of independent variables
5
Building a model
§ Dependent variable:
– Typical price in 1990-1991 wine auctions (approximates quality)
– Conduct logarithmic transformation
q A better linear fit
§ Independent variables:
– Age of wine (in 1990)
q Older wines are more expensive
– Weather
q Average Growing Season Temperature (AGST)
q Harvest Rain
q Winter Rain
– Population of France
6
The wine data (1952 - 1978)
8
The wine data (1952 - 1978)
Quick Question: What is the relationship between harvest rain, average
growing season temperature, and wine prices?
9
Baseline model (?)
10
Baseline model (Take the mean)
ne
y .....
In
&
11
One-Variable Linear Regression
me
e ⑧
12
Simple Regression Model
The population model of y with one predictor variable x is:
-
-
-
! =# +# %+ε
! "
I
-
↑
-
↑
-
§ y is the dependent variable (DV) Bl
§ x is the independent variable (IV) Pr
§ Regression Function
§ E[Y|x] = -
-
!
- ! + !" # is the mean of Y given x
-
§ !! is the y-intercept (value of E[Y|0] when x=0)
e n
§ !" is the slope for x, which is the change in E[Y|x] for a unit increase&
in x
-
§ Random errors e (not required)
-
§ Random errors are a random sample from $ 0, '#
-
Random samples are i.i.d. or
independent and identically
-
Each observation has a random error
-
§ The output does not show these, but it does estimate se distributed random variables
§ The random errors e and IV (X) are uncorrelated
§ These assumptions are important for effective business analytics
-
13
Estimated Regression Function
§ Estimates the regression model with n observations (xi,yi) for i = 1, …, n
-
§ The estimated or predicted value of y given x is:
&
"! = $ ! + $" &
44
&
§ '! is the sample estimate of the population intercept #!-
§ '" is the sample estimate of the population slope #"
-
'! and '" are sample statistics
)
(similar to *)
and have sampling distributions
14
One-Variable Linear Regression
brtb i)
,
15
Data and Predicted Values
§ What is the observed y when x = 1?
YG =
§ What is the predicted y when x = 1?
4 =
§ What is the observed y when x = 4?
O
§ What is the predicted y when x = 4?
16
Data and Predicted Values
§ What is the observed y when x = 1?
y=6
§ What is the predicted y when x = 1?
!=1+(2)(1)
, =3
§ What is the observed y when x = 4?
y=4
§ What is the predicted y when x = 4?
!=1+(2)(4)
, =9
17
Estimated Model and-
Residuals
e
§ Residuals are the difference between the observed values of ! and
-
predicted values of !,
-
– y - $# 8
! =-=
– Each observation has one observed y, one predicted $,
# and one residual r.
§ The residuals are errors between the observed and predicted values.
y3
"!$
## = "# − "!#
y1
"!" #$ = "$ − "!$
"!#
#! = "! − "!!
y4
"!! y2
#" = "" − "!"
18
Computing Residuals
r3
↓ r4
r1 r2
j
§ What is the residual r2 at x = 2? 1
y Y
=
-
§ What is the residual r3 at x = 3?
19
Computing Residuals
r3
r4
r1 r2
r -
§ What is the residual r2 at x = 2?
-# = !# − !,# = 3 − 1 + 2 ∗ 2 = 3 − 5 = −2
§ What is the residual r3 at x = 3?
-$ = !$ − !,$ = 11 − 1 + 2 ∗ 3 = 11 − 7 = 4
20
Ordinary Least Squares Criterion or (OLS)
The least squares line finds the estimates '! and '" of the coefficients to
minimize the sums-of-squares error for a sample {(xi,yi)} with n observations:
W I
-
667 = !% − !,% # ∑'%&"
!,% = '! + '" %% for < = 1, … , ? SSE
Why squared?
↑ min
The sum of residuals could be zero.
-> '
667('! , '" ) = A !% − '! − '" %% #
∑'%&" %% − %̅ !% − !)
D
-
%&" '" =
B667('! , '" ) ∑'%&" %% − %̅ #
3
= 0
B'! C
'! = !) − '" %̅
B667('! , '" ) %:̅ sample average of independent variable
-
= 0 -
B'" ) sample average of dependent variable
!:
- -
- -
Do not need to memorize.
21
Estimate a linear model H0: AGST Coefficient = 0 versus HA: AGST
(One Variable ) Coefficient ≠ 0
-
O0
Estimated Standard Errors t-score = (Estimated Coefficient – 0)/(Standard Error)
intercept and for estimated
slope coefficients
Two-Tail Test: p-value = 2*P(T<-|t-score|)
V
-
Coefficient of Determination: R-Squared
23
One-Variable Linear Regression
, -3.4178 + 0.6351*AGST
!=
24
Estimate a linear model
(One Variable )
• Estimated model for price:
D
, -3.4178 + 0.6351*AGST
!=
• The predicted LogPrice increases by
-
-
0.6351 for every 1 degree increase in
-
average growing season temperature.
• If AGST = 15, then !=, ?
• If AGST = 18, then !=
, ?
O • If AGST = 20, then !=
, ?
i
25
T-Tests for the Coefficients: H0: bj = 0 versus HA: bj ≠ 0
& -
Two-Tail Test for the Slope
(Very important. Can you predict Y from X?)
H0: b1 = 0 versus HA: b1 ≠ 0
• t-score = (coefficient – 0)/(std.error)
-
• t-score = (0.6351-0)/0.1509 = 4.208
• p_value = 2*P(T < -|4.208|)
↓- =2*t.dist(-4.208, 23, 1) < .001
• df = n-1-#IV = df Error under Sum of squares
-
O ⑧
e⑪
men
-
df = 23
⑰
-
& A
n
-
1 -
#IF
I
st ↑
I
25 1 1 23
=
- -
=
4 208
. ,
of =
23 .
0 .
001 <P rate -
<0 .
01 **
P([s> 200) xP(ts -3 76) 0 001
=
4 2 <
.
<
2 x
-
.
.
26
How well the model fits data
§ The simplest commonly used measure of fit is R# (the coefficient of
determination): R# = 1 − SSE/SST
-
-
– SSE = ∑&$%" y$ − y. $ ' : sum of squared errors - *
q Variation of Y that cannot be explained by the regression R 1
– SST = ∑&$%" y$ − y0 ' : total sum of squares v
1-
e
q Total amount of variation of Y around its mean
q “Error” generated by a baseline model without any inputs
-
– Decomposition of variation of Y:
SSF
&
SSE =
∑'%&" !% − !) # = ∑'%&" !% − !,% # + ∑'%&" !,% − !) #
7
q
-
-> Total variation Unexplained variation Explained variation
1- SE
s =
-
SSE *
SS7
R# = Proportion of the variance in DV is explained by the
regression model.
27
Coefficient of Determination: R-Squared
• R-Squared is a measure of fit
• Bigger R-Squared indicates better fit all
else being equal
• 43.5% of the variation of prices is
explained by the simple regression on
AGST
-
• 0 < R-Squared < 1
28
Use each variable on its own
§ R# =0.44 using Average growing season temperature (Variable
-
Significant, 0.001)
R# =0.32 using Harvest rain (Variable Significant, 0.01)
↓
§
§ R# =0.22 using France Population (Variable Significant, 0.05)
§ R# =0.20 using Age (Variable Significant, 0.05)
§ R# =0.02 using Winter rain (Not Significant)
§ Multivariate linear regression allows us to use more than one
variable to potentially improve our predictive ability.
29