1725857551_SMA32

Regression Analysis
• Existence & degree of association : Correlation

• Extent of causal relationship: Regression
• Simple Linear regression model:

– Estimated y is ŷ = a + b x
Least square method
• If yi = a + b xi + ei
i.e., actual y = ŷ + error value
• Then minimize the squared sum of ei
n n 2
e
i 1
2
i    yˆ
i 1
i  ( a  bxi ) 
• Solving the following two normal equations for a and b
 y  na  b x
 xy  a x  b x 2
• Alternatively
 xy   x y
S xy n
b 
S xx  x
2
x  n
2
a
 y
b
 x
n n
Coefficient of Determination
 x y 
2

2  xy  
S xy  n 
R2  
S xx S yy 
 x 
2
 x 2
 
  y 
2
 y 2


 n 

n 

• R2 : Proportion of variation of values of y explained by the

regression model.
• 0≤ R2 ≤1
• R2 = 1, indicates the regression line is a perfect
estimation of linear relationship between x & y.
• R2 = 0, indicates no relationship
Example: Sales manager intends to see the relationship between the
constituents of a food product and the consumer’s preference. He
identified a potential costumer and got his preferences on a 1-9 scale on 10
different alternative products with varying protein contents.
Consumer’s Preferences Protein

rating attempts (Y) (X)
1 3 4
2 7 9
3 2 3
4 1 1
5 6 3
6 2 4
7 8 7
8 3 3
9 9 8
10 2 1
Protein Preferences
xy x2 y2 Sxy 62.1
(x) (y)
Sxx 70.1
4 3 12 16 9
Syy 76.1
9 7 63 81 49
3 2 6 9 4
1 1 1 1 1 b= 0.886
3 6 18 9 36 a= 0.491
4 2 8 16 4 R2 = 0.723
7 8 56 49 64
3 3 9 9 9
8 9 72 64 81
1 2 2 1 4
43 43 247 255 261
• The normal equations: 10a + 43b = 43
43a + 255b = 247
• so estimated b = 0.886 and a = 0.491
• Regression line : ŷ = 0.491 + 0.886x
• The regression coefficient b = 0.886 indicates the change in
consumer’s preference with unit change in protein contents.
• Coefficient of Determination , R2 = 0.723
It implies that 72.3% of the variation in preference levels is
explained by the estimated line and the remaining 27.7% of
the variation may be explained either by other variables or
errors in measurements or both.
Multiple Linear Regression
More than one independent variables are involved to express
a single dependent variable.
Attempt is to increase the accuracy of the estimates.
It is expressed as,
y  a  b1 x1  b2 x2  ......  bk xk  
where, x1 , x2 ,........, xk are independen t variable s and
y is the dependent variable.
b1 , b2 ,......... , bk are the regression coefficients
which are to be estimated.
 is the error term
Broad steps involved in developing a linear
multiple regression model
1. Hypothesize the form of the model. This

involves the choice of the independent
variables to be included in the model.
2. Estimate the unknown parameters, a, b1, b2,

….bk.
3. Make inferences on the estimates.

Scatterplot Matrix (SPLOM)
• Scatterplot Matrix plots all possible combinations of two or
more numeric variables against one another.
• The plots are arranged in rows and columns, with the same
number of rows and columns as there are variables. The point
of the plot is simple. When you have many variables to plot
against each other in scatterplots, it is logical to arrange the
plots in rows and columns using a common vertical scale for
all plots within a row (and a common horizontal scale within
columns). All complete x-y pairs within each plot are used;
that is pairwise deletion is used for missing data.
Machine Cross
Density direction direction
strength strength
0 0.801 121.41 70.42
1 0.824 127.70 72.47
2 0.841 129.20 78.20
3 0.816 131.80 74.89
4 0.840 135.10 71.21
from scipy import stats

dfz = stats.zscore(df)
import seaborn
seaborn.pairplot(dfz,
kind='scatter',diag_kind="k
de",palette="deep")
LEAST SQUARES ESTIMATION OF THE PARAMETERS
The least squares function is

Normal Equations are written as matrix form,
Coefficient of Determination
• R2 : Proportion of variation of values of y explained by the
regression model.
• 0≤ R2 ≤1
• R2 = 1, indicates the regression line is a perfect
estimation of linear relationship between x & y.
• R2 = 0, indicates no relationship
Hypothesis testing in
Multiple Linear Regression
I. Test for significance of regression
This test for significance is to determine whether a

linear relationship exists between the response variable y
and a set of the regressor variables x1, x2, ….xk.
The hypothesis are

Ho : b1 = b2 = ….= bk = 0
H1 : bj ≠ 0 for at least one j.
ANOVA for testing significance of regression
Source of Variation Sum of df Mean Sum of Fo p-value

Squares Squares
Regression SSR k MSR=SSR/k MSR/MSE
Error SSE n-k-1 MSE=SSE/(n-k-1)
Total TSS n-1
n is the number of data points in the sample

n
SSE   ( y i  yˆ i ) 2 SSR  TSS  SSE
i 1
2
 n

 i y
 i 1 
n n
TSS   ( y i  y i ) 2   y i2 
i 1 i 1 n
From table we get, Fα, k, n-k-1 = Ftable
If Fo > Ftable , then reject Ho
OR,
if p-value < α , then reject Ho.

II. Tests on individual regression co-efficients.
Such tests are useful in determining the potential of each of

the regressor variables in the regression model.
The model might be more effective with the inclusion of an

additional variable or perhaps the deletion of one or more of
the regressors present in the model.
Hypothesis:
Ho : bj = 0
bˆ j
t0 
H1 : bj ≠ 0  
se bˆ j
If t 0  t  / 2, n  k 1 , OR, if p-value < α/2 , then reject Ho.

III. Confidence Interval for dependent variable.
A 100(1-α) % CI on the dependent variable is given by
yˆ  t / 2,n  k 1 se yˆ   yˆ  yˆ  t / 2,n  k 1 se yˆ 
se( yˆ )  MSE  Standard error of Estimate

IV. Confidence Intervals of individual
regression co-efficients.
A 100(1-α) % CI on the regression co-efficient

bj is given by,
   
bˆ j  t / 2,n k 1 se bˆ j  b j  bˆ j  t / 2,n  k 1 se bˆ j
Standardized regression coefficient
Regression model is estimated using standardized data.
Dimensionless regression co-efficient can help to compare the

relative importance of each variable.
If bˆ j  bˆi , then we can say that regressor xj produces a

larger effect than the regressor xi.
EX 1 : The data shown in Table represent the thrust of a jet-turbine engine (y) and
six candidate regressors: x1 = primary speed of rotation, x2 = secondary speed of
rotation, x3 = fuel low rate, x4 = pressure, x5 = exhaust temperature, and x6 =
ambient temperature at time of test.
(a) Fit a multiple linear regression model with the above data and interpret the
results.
(b) Fit a multiple linear regression model using x3 = fuel low rate, x4 = pressure,
and x5 = exhaust temperature as the regressors, and interpret the results.
(c) Refit the model using y∗ = ln (y) as the response variable and x3* = ln(x3) as
the regressor (along with x4 and x5). How do you compare with the previous
fitted regression model?
Obs y x1 x2 x3 x4 x5 x6
1 4540 2140 20640 30250 205 1732 99
2 4315 2016 20280 30010 195 1697 100
3 4095 1905 19860 29780 184 1662 97
4 3650 1675 18980 29330 164 1598 97
5 3200 1474 18100 28960 144 1541 97
6 4833 2239 20740 30083 216 1709 87
7 4617 2120 20305 29831 206 1669 87
8 4340 1990 19961 29604 196 1640 87
9 3820 1702 18916 29088 171 1572 85
10 3368 1487 18012 28675 149 1522 85
11 4445 2107 20520 30120 195 1740 101
12 4188 1973 20130 29920 190 1711 100
13 3981 1864 19780 29720 180 1682 100
14 3622 1674 19020 29370 161 1630 100
15 3125 1440 18030 28940 139 1572 101
16 4560 2165 20680 30160 208 1704 98
17 4340 2048 20340 29960 199 1679 96
18 4115 1916 19860 29710 187 1642 94
19 3630 1658 18950 29250 164 1576 94
20 3210 1489 18700 28890 145 1528 94
21 4330 2062 20500 30190 193 1748 101
22 4119 1929 20050 29960 183 1713 100
23 3891 1815 19680 29770 173 1684 100
24 3467 1595 18890 29360 153 1624 99
25 3045 1400 17870 28960 134 1569 100
26 4411 2047 20540 30160 193 1746 99
27 4203 1935 20160 29940 184 1714 99
28 3968 1807 19750 29760 173 1679 99
29 3531 1591 18890 29350 153 1621 99
30 3074 1388 17870 28910 133 1561 99
31 4350 2071 20460 30180 198 1729 102
32 4128 1944 20010 29940 186 1692 101
33 3940 1831 19640 29750 178 1667 101
34 3480 1612 18710 29360 156 1609 101
35 3064 1410 17780 28900 136 1552 101
36 4402 2066 20520 30170 197 1758 100
37 4180 1954 20150 29950 188 1729 99
38 3973 1835 19750 29740 178 1690 99
39 3530 1616 18850 29320 156 1616 99
40 3080 1407 17910 28910 137 1569 100
EX 2
import pandas as pd
import statsmodels.api as sm
import numpy as np
df = pd.read_csv("C:/Users/ … /….csv")
# PROVIDING DATA
X = df.iloc[:,2:].copy()
y = df['y'].copy()
X, y = np.array(X), np.array(y)
X = sm.add_constant(X)
model1 = sm.OLS(y, X)
results1 = model1.fit()
print(results1.summary())
print("\n Fitted Values:\n")
y_pred = results1.fittedvalues.round(2)
print(y_pred)

1725857551_SMA32

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

1725857551_SMA32

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1725857551_SMA32

Uploaded by

Copyright:

Available Formats

Regression Analysis

• Existence & degree of association : Correlation

• Simple Linear regression model:

• R2 : Proportion of variation of values of y explained by the

Consumer’s Preferences Protein

Attempt is to increase the accuracy of the estimates.

1. Hypothesize the form of the model. This

2. Estimate the unknown parameters, a, b1, b2,

3. Make inferences on the estimates.

from scipy import stats

The least squares function is

This test for significance is to determine whether a

The hypothesis are

Source of Variation Sum of df Mean Sum of Fo p-value

n is the number of data points in the sample

If Fo > Ftable , then reject Ho

if p-value < α , then reject Ho.

Such tests are useful in determining the potential of each of

The model might be more effective with the inclusion of an

If t 0  t  / 2, n  k 1 , OR, if p-value < α/2 , then reject Ho.

A 100(1-α) % CI on the dependent variable is given by

yˆ  t / 2,n  k 1 se yˆ   yˆ  yˆ  t / 2,n  k 1 se yˆ 

se( yˆ )  MSE  Standard error of Estimate

A 100(1-α) % CI on the regression co-efficient

Regression model is estimated using standardized data.

Dimensionless regression co-efficient can help to compare the

If bˆ j  bˆi , then we can say that regressor xj produces a

You might also like