Multiple Linear Regression
Sasadhar Bera, IIM Ranchi
Multiple Linear Regression Model
Multiple linear regression involves one dependent
variable and more than one independent variable. The
equation that describes multiple linear regression model is
given below:
y = 0 + 1 x1 + 2 x2 + .
. + k xk +
y is dependent variable and x1, x2 , .
.
.,xk are
independent variables. These independent variables being
used to predict the dependent variable.
0 , 1 , 2 , . . ., k are total (k+1) unknown regression
coefficients (also called model parameters). These
regression coefficients are estimated based on observed
sample data.
The term (pronounced as epsilon) is random error.
Sasadhar Bera, IIM Ranchi
Data for Multiple Regression
Suppose that n number of observations are collected for
response variable (y) and k number of independent
variables present in the regression model.
i = 1, 2, . . ., n
y
y1
y2
.
yi
.
yn
x1
x11
x21
.
xi1
.
xn1
j = 1, 2, . . .,k
x2
x12
x22
.
xi2
.
xn2
.
.
.
.
.
.
.
xj
x1j
x2j
.
xij
.
xnj
.
.
.
.
.
.
.
xk
x1k
x2k
.
xik
.
xnk
Sasadhar Bera, IIM Ranchi
Scalar Notation: Multiple Linear Regression
Suppose that n number of observations are collected for
response variable (y) and k number of independent
variables present in the regression model.
The scalar notation of regression model:
yi = 0 + 1 xi1 + 2 xi2 + .
i = 1, 2, . . ., n
j = 1, 2, . . .,k
. + j xij + . . + k xik + i
n = total number of observations
k = number of independent variables
j s are model parameters.
Sasadhar Bera, IIM Ranchi
Matrix Notation: Multiple Linear Regression
Suppose that n number of observations are collected for
response variable (y) and k number of independent
variables present in the regression model.
yn1 = Xn(k+1) (k+1) 1 + n1
n = total number of observations, k = total number of
variables, is model parameters in vector notation.
y1
.
y yi
.
y n
1 x11 . x1 j
. . .
.
X 1 x i1 . x ij
. . .
.
1 x . x
n1
nj
.
.
.
.
.
x 1k
.
x ik
.
x nk
0
1
.
j
.
k
1
.
i
.
n
Sasadhar Bera, IIM Ranchi
Model Parameter Estimation
The error in regression model is the difference between
actual and predicted value. It may be positive or negative
value.
Error is also known as residual. Predicted value by
regression equation is called fitted value or fit.
The sum of squared difference between the actual and
predicted values known as sum of square of error. Least
square method minimizes the sum of square of error to
find out the best fitting plane.
It is to be noted that the regressor variables in linear
regression model are non-random. That means its values
are fixed.
Sasadhar Bera, IIM Ranchi
Model Parameter Estimation (Contd.)
In matrix notation, the regression equation:
y =X +
By using least square estimator, we want estimate
n
that minimizes L =
i 1
2
i
=
T
y X ( y X)
T
The least square estimator must satisfy:
T
T
( L) 2 X y 2 X X 0
( XT X)1 XT y , estimated model parameters.
The fitted regression line: y X
Sasadhar Bera, IIM Ranchi
Estimated Residual and Standard Error
For
ith
observation (Xi), predicted value or Fit :
y i Xi
Error in the fit called residual:
ei y i y i
n
2
e
i
Mean Square Error = MSE =
i 1
n k 1
where n is the total number of observations, k is number
of regressors.
Standard error (SE) of estimate = =
MSE
Variance( ) = (X T X) 1
Sasadhar Bera, IIM Ranchi
Testing Significance of Regression Model
The test for significance of regression is a test to
determine if there is a linear relationship between the
response variable and regressor variables.
H0 : 1 = 2 = . . . = k = 0
H1 : At least one j is not zero
The test procedure involves an analysis of variance
(ANOVA) partitioning of the total sum of square into a sum
of squares due to regression and a sum of square due to
error (or residual)
Total number of model parameters = p = Number of
regression coefficients = (k+1)
Sasadhar Bera, IIM Ranchi
10
Testing Significance of Regression Model (Contd.)
ANOVA table
Source of
Variation
Regression
Residual
error
Total
DF
SS
MS
FCal
SSR
SSR /k =MSR
MSR/MSE
n k-1
SSE
SSE / (n-k-1)
= MSE
n 1
TSS
y
2
i
n
T
SSR yi y XT y i1
n
i 1
n
TSS = SSR + SSE
SSE yi yi y T y XT y
i 1
n
TSS yi y
i 1
Sasadhar Bera, IIM Ranchi
11
Significance Test of Individual Regression
Coefficient
Adding an unimportant variable to the model can actually
increase the mean square error, thereby decreasing the
usefulness of the model.
The hypothesis for testing the significance of any
individual regression coefficient, say j is
H0: j = 0
H1: j 0
Test Statistic = Tcal =
j
2 C jj
, ( n k 1)
where 2 is mean square error (MSE) and C is the diagonal
element of (XTX)-1 . Reject H0 if Tcal > t , ( n k 1)
2
12
Sasadhar Bera, IIM Ranchi
Confidence Interval of Mean Response
In matrix notation, the regression equation:
y =X +
where Normal (0, 2)
Mean response at a point x0 = [1, x01, x02, . .,x0j, . . .,x0k ]T
Mean response = y = E(y) = E(X ) + E() = X + 0
y|x = E(y | x0 ) = x0
0
var(y | x0 )
x T0 (XT X)1 x 0
(1-) % confidence interval of mean response at point x0
y|x
( n p )
x T0 (XT X)1 x 0
Sasadhar Bera, IIM Ranchi
13
Coefficient of Multiple Determination
Coefficient of multiple determination =
R2
SSR
=
TSS
SSR TSS SSE
SSE
1
TSS
TSS
TSS
SSR : Sum of square due to regression
SSE : Sum of square due to error
TSS : Total sum of square
Coefficient of variation is the fraction of variation of the
dependent variable explained by regressor variables.
R2 is measure the goodness of linear fit. The better the
linear fit is, the R2 closer to 1.
14
Sasadhar Bera, IIM Ranchi
Coefficient of Multiple Determination (Contd.)
The major drawback of using coefficient of multiple
determination (R2) is that adding a predictor variable to the
model will always increase R2, regardless of whether the
additional variable is significant or not. To avoid such
situation, regression model builders prefer to use adjusted
R2 statistic.
SSE
2
adj
n 1
( n p)
(1 R 2 )
1
1
TSS
n p
(n 1)
In general, adjusted R2 statistic will not increase as variables
are added to the model.
When R2 and adjusted R2 differ dramatically there is a good
chance that non-significant terms have been included in the
15
model.
Sasadhar Bera, IIM Ranchi