UNIT IV
Regression analysis is a statistical technique for investigating and
modeling the relationship between variables.
The simple linear regression model is given by the equation
y = β0 + β1x + ------------ (1)
Where x is called the predictor or regressor variable and y is called the response
variable. The quantity is called the error which is equal to the difference
between observed value and the estimated value. Note that y = β0 + β1x is the
equation of least squares straight line connecting the variables x and y, where β 0
is the intercept and β1 is the slope.
Suppose that fix the value of the regressor variable x and observe the
corresponding value of the response y. Then E(Y = y/X = x) is given by
E(y/x) = µy/x
= E(β0 + β1x + )
= β0 + β1x
Similarly,
Var(y/x) = σ2y/x
= Var(β0 + β1x + ε) = σ2
In general, the response variable y may be related to k regressors, x 1,x2,……xk
so that
y = β0 + β1x1 + β2x2 + ………… βk xk+ ---------------------- (2)
This is called a multiple linear regression model as more than one regressor is
involved.
Regression models are used for several purposes, some of them are;
(i) Data description
(ii) Parameter estimation
(iii) Prediction and estimation
(iv) Control etc.
Simple linear regression model
The simple linear regression model is a model with a single regressor x
that has a relationship with a response y that is a straight line. This model is
given by
y = β0 + β1x + ------------ (1)
where β0 is the intercept, β1 is the slope and ε is a random error component. The
errors are normally distributed with mean zero and variance σ2.
Clearly we know that E(y/x) = β0 + β1x ---------------- (2)
Var(y/x) = Var(β0 + β1x + ε ) = σ2 ---------------- (3)
The parameters β0 and β1 are called regression coefficients.
Least Squares Estimation of the parameters:
We will use the method of least squares to estimate the parameters β0 and
β1 in (1). That is we will estimate β0 and β1 so that the sum of squares of the
differences between the observations yi and the straight line is a minimum.
From (1), we have
yi = β0 + β1xi + i ---------------- (4) , i = 1,2,……. n
Equation (1) may be viewed as a population regression model whereas (4) is a
simple regression model, written in terms of the n pairs of data (x i ,yi) ,
i = 1,2,……. n.
According to least squares criterion,
n
S (β0, β1) = (y
i =1
i – β0 – β1xi)2 ------------------ (5) is a minimum.
So we obtain the following equations:
S n
= 0 - 2 (y − − 1 xi ) = 0
0
i 0
i =1
S n
= 0 −2 ( yi − 0 − 1 xi ) xi = 0
1 i =1
Simplifying the above equations, we get
n n
n 0 + 1 xi = yi
→ ( 5)
i =1 i =1
n n n
0 xi + 1 xi = xi yi
2
i =1 i =1 i =1
Solving (5), we get 0 = 0 and 1 = 1 . Note that
0 and 1 are called the least squares estimators of 0 and 1 . Here the equations
(5) are called the least squares normal equations.
Since ( x , y ) lies on the least squares line, we have y = 0 + 1 x
0 = y − 1 x → (6)
n
n
n
x y − x y
i i i i
And ˆ1 = i =1 i =1 i =1
2
→ (7)
n
xi
xi2 − i =1
n
i =1 n
Equation (6) is obtained from the first equation of (5), after dividing with n,
1 n 1 n
where y = i
n i =1
y , x = xi .
n i =1
The fitted simple linear regression model is given by
ŷ = ˆ0 + ˆ1 x → (8)
Equation (8) gives the point estimate of the mean of y for a particular x.
2
n
xi
i =1
n
Let us denote S xx = xi −
2
i =1 n
n 2
= ( xi − x ) → (9)
i =1
n n
xi yi
S xy = xi yi − i =1 i =1
n
And
i =1 n
n
= yi ( xi − x ) → (10)
i =1
S xy
Now equation (7) can be written as 1 = → (11)
S xx
The difference between the observed value yi and the corresponding fitted
value yi is called the residual.
That is, ei = yi - yi
= yi − ( 0 + 1 x ), i = 1,2,………n → (12)
Problem : A rocket motor is manufactured by bonding an igniter propellant and
a sustainer propellant together inside a metal housing. It is suspected that shear
strength is related to the age in weeks of the batch of propellant have been
collected. Fit a least squares regression model to the following data.
Observation Shear strength (psi) Age of propellant(weeks)
i yi xi
1 2158.70 15.50
2 1678.15 23.75
3 2316.00 8.00
4 2061.30 17.00
5 2207.50 5.50
6 1708.30 19.00
7 1784.70 24.00
8 2575.00 2.50
8 8
Solution: n = 8, xi = 115.25 , y i = 16489.65, x = 14.4063 , y = 2061.2063
i =1 i =1
n n
xi2 = 2130.8125
i =1
x y
i =1
i i = 220755.2625
S xx = 470.49218 S xy = −16798.7578
S 16798.7578
Now, 1 = S = − 470.49218 = −35.70464
xy
xx
0 = y − 1 x
= 2061.2063-(-35.70464)(14.4063)
= 2575.578
The least squares line is
y = 0 + 1 x = 2575.578 − 35.70464 x
Properties of the least squares estimators and the fitted regression model :
The least squares estimators 0 and 1 have several important properties.
(1) 0 and 1 are linear combinations of the observations yi .
S
i.e., 1 = S
xy
xx
n
yi ( xi − x )
= i =1 n
(x − x )
2
i
i =1
n
( xi − x )
= y i n
( x − x )
i =1 2
i
i =1
n
( xi − x ) xi − x
= c y i i , where ci = n = S
i =1
( x − x )
2
xx
i
i =1
(2) The least squares estimators 0 and 1 are unbiased estimators of the model
parameters 0 and 1 .
i.e., E ( 0 ) = 0 and E ( 1 ) = 1
21 x2
+
Var( 0 ) = n S
xx
2
Var( 1 ) = S xx
Where 0 = y − 1 x →
S xy
1 =
S xx
(3) The sum of the residuals in any regression model that contains an
intercept 0 is always zero.
( y − y ) = 0
n
i.e., i i
i =1
( ) y − ( )
n n
Consider yi − yi = + 1 xi
i =1 i =1
i 0
= y −
i =1
i 0 − 1 xi
n n n
= yi − 0 − 1 xi
i =1 i =1 i =1
= ny − n0 − nx 1
(
= ny − n y − 1 x − nx 1 )
= ny − ny + n1 x − nx 1
=0
(4) The sum of the observed values yi equals the sum of the fitted values yi .
n n
i.e., yi = yi
i =1 i =1
(5) The least squares regression line always passes through the point ( x , y ) .
(6) The sum of the residuals weighted by the corresponding value of the
regressor variable always equals zero.
n
That is, xei =1
i i =0
(7) The sum of the residuals weighted by the corresponding fitted value
always equals zero.
That is, ye
i =1
i i =0
Estimation of :
2
The estimator of is obtained from the residual sum of squares denoted by
2
SSRe s as follows:
Assuming that yi is normally distributed, it follows that SSRe s has a
2
distribution n-2 degrees of freedom, so
SS Re s
n2− 2
2
But, E ( SSRe s ) = ( n − 2)
2
SS
E Re s =
2
n−2
SS
Thus an unbiased estimator of is S = n − 2
2 2 Re s
= MSRe s
( Where MSRe s : Mean square of residuals or Residual mean square)
Note that SSRe s = e
i =1
2
i
( y − y )
n 2
= i i
i =1
n 2
= yi − 0 − 1 xi
i =1
n 2
=
i =1
yi − y + 1 x − 1 xi
[ 0 = y − 1 x ]
( y − y ) + ( x − x )
2
= i i 1
i =1
n 2 n n
( y − y ) + 1 ( x − x ) − 21 ( xi − x )( yi − y )
2 2
= i i
i =1 i =1 i =1
2
= S yy + 1 S xx − 21S xy
S xy
( )
2
= S yy + 1 S xx − 21 1S xx 1 =
S xx
2 2
= S yy + 1 S xx − 21 S xx
2
= S yy − 1 S xx
2
S
= S yy − xy S xx
S xx
S xy2
= S yy −
S xx
SSRe s = S yy − 1S xy
Then SSRe s = SST − 1 S xy
n
( Let us denote SST = S yy = ( yi − y ) )
2
i =1
Problem: with reference to the previous problem, find the estimate of 2 .
Hypothesis testing on the slope and intercept :
Hypothesis testing on the slope ( 2 is known ):
Suppose that we wish to test the hypothesis that the slope equals a constant
say 10 . The appropriate hypotheses are
Null Hypothesis H 0 : 1 = 10
Alternative Hypothesis H1 : 1 10
1 − 10
Test statistic, Z 0 =
2
S xx
Decision : Reject H 0 if | Z 0 | > Z
2
Also the (1 − )100 % confidence interval for 1 is
2 2
1 − Z < 1 < 1 + Z
2
S xx 2
S xx
Hypothesis testing on the slope ( 2 is unknown ):
Suppose that we wish to test the hypothesis that the slope equals a constant
say 10 . The appropriate hypotheses are
Null Hypothesis H 0 : 1 = 10
Alternative Hypothesis H1 : 1 10 , where 10 is a specified constant
1 − 10
Test statistic, t0 =
MSRe s
S xx
1 − 10
= →
2
S xx
Decision: Reject H 0 if t0 > t (n-2) degrees of freedom.
2
Also the (1 − )100 % confidence interval for 1 is
SSRe s SSRe s
1 − t (n − 2)d . f 1 1 + t (n − 2)d . f
2
(n − 2) S xx 2
(n − 2) S xx
Note: The denominator of the test statistic, t0 in is called the estimated
standard error or standard error of the slope and is denoted by Se 1 . ( )
( )
Se 1 =
MSRe s
S xx
So, we can also write the statistic t0 as
1 − 10
t0 = =
Se( 1 )
Problem: The following are measurements of the air velocity and evaporation
coefficient of burning fuel droplets in an impulse engine:
Air velocity (cm/sec) : 20 60 100 140 180 220 260 300 340 380
Evo. Coeff(mm2/sec): 0.18 0.37 0.35 0.78 0.56 0.75 1.18 1.36 1.17 1.65
(i) Fit a simple linear regression model to the above data.
(ii) Test the null hypothesis = 0 against the alternative hypothesis 0 at
the 0.05 level of significance.
(iii) Construct a 95% confidence interval for the slope parameter 1 .
Solution: Let y = 0 + 1 x is the simple linear regression model.
10 10 10 10 8
xi = 2000 ,
i =1
xi2 = 532000 ,
i =1
yi = 8.35 ,
i =1
xi yi = 2175.40 ,
i =1
y
i =1
2
i = 9.1097
2
10
xi
S xx = xi − i =1
10
2
i =1 n
( 2000)
2
= 532000 - = 132000
10
10 10
10 xi yi
S xy = xi yi − i =1 i =1
i =1 n
( 2000 )(8.35) = 505.40
= 2175.40 −
10
2
10
yi
S yy = yi − i =1
10
2
i =1 n
(8.35)
2
= 9.1097 − = 2.13745
10
S xy 505.40
Now, 1 = = = 0.00383
S xx 132000
0 = y − 1 x
8.35 2000
= − 0.00383
10 10
= 0.069
y = 0.069 + 0.00383x
(ii) Null Hypothesis, H o : 1 = 0
Alternative Hypothesis, H1 : 1 0
Since 2 is unknown, we use‘t’ statistic
0 − 10 SSRe s
Test the statistic, t0 = , MSRe s =
MSRe s n−2
S xx
S 2 xy (505.40) 2
SSRe s = S yy − = 2.13745 −
S xx 132000
= 0.20238
0.20238
MSRe s =
8
= 0.0252975
0.00383
t0 =
0.0252975
132000
t0 = 8.7488
t ,(n − 2) d . f = t0.025 8 d . f = 2.306
2
Decision: Reject H 0 if t0 t0.025 (8) d . f
Since t0 = 8.7488 exceeds t0.025 (8) d . f = 2.306 ,we have to reject the null
Hypothesis .So, Take the alternative hypothesis. That is H1 : 1 0
(iii) The (1 − )100% confidence limits for 1 are
SS Res
1 t (n − 2)d . f
2 (n − 2) S xx
MS Res
1 t (n − 2)d . f
2 S xx
Here (1 − )100 = 95
(1 − ) = 0.95
= 0.05
= 0.025
2
t0.025 (8) d . f = 2.306
MSRe s = 0.0252975
S xx = 132000
MS Res
= 0.0000012049
S xx
0.00383 (2.306)(0.0000012049)
0.0038272 1 0.0038327
Hypothesis testing on the intercept( 2 is Known):
Null Hypothesis, H o : 0 = 00
Alternative Hypothesis, H1 : 0 00 , where 00 is a specified constant
Sample size = n
Level of significance =
0 − 00
Test the statistic, Z 0 =
1 x2
2( + )
n S xx
Decision: Reject the null hypothesis is z0 z
2
Also (1 − )100% confidence interval for 0 is
1 x2
0 Z 2 ( + )
2 n S xx
1 x2 1 x2
0 − Z 2 ( + ) 0 0 + Z 2 ( + )
2 n S xx 2 n S xx
Hypothesis testing on the intercept( 2 is unknown):
Null Hypothesis, H o : 0 = 00
Alternative Hypothesis, H1 : 0 00 , where 00 is a specified constant
Sample size =n
Level of significance =
0 − 00
Test the statistic, t0 =
1 x2
MS Re s ( + )
n S xx
Decision: Reject the null hypothesis is t0 t , n − 2
2
Also (1 − )100% confidence interval for 0 is
1 x2 1 x2
0 − t , n − 2 MSRe s ( + ) 0 0 + t , n − 2 MSRe s ( + )
2 n S xx 2 n S xx
Problem: The following data pertain to the number of computer jobs per day
and the central processing unit time required,
No.of jobs (x) 1 2 3 4 5
CPU Time(y) 2 5 4 9 10
(i) Fit a least squares line y = 0 + 1 x
(ii) Predict the mean CPU time when x = 3.5
(iii) Test the null hypothesis H 0 : 0 = 0.002 against the alternative
hypothesis H1 : 0 0.002 at 5% level of significance.
(iv) Construct a 95% confidence interval for 0
Solution: (i) x
i
i = 15, yi = 30, xi 2 = 55, yi 2 = 226, xi yi = 110, x = 3 and y = 6
i i i i
S xx = 10, S yy = 46, Sxy = 20
S xy 20
1 = = =2
S xx 10
0 = y − 1 x
= 6 − (2)(3)
=0
The least squares line is y = 2 x
(ii) when x = 3.5 , y = 2(3.5) = 7
(iii) H o : 0 = 0.002 (here 2 is unknown)
H1 : 0 0.002
n = 5, = 0.05
S 2 xy (20) 2
SS Re s = S yy − = 46 − =6
S xx 10
SSRe s 6
MSRe s = = =2
3 3
0 − 00
Test the statistic, t0 =
1 x2
MS Re s ( + )
n S xx
0 − 0.002
=
1 32
2( + )
5 10
−0.002
=
2 18
( + )
5 10
5
= −0.002
11
= −0.000909
t , (n − 2) d . f = t0.025 (3 d . f ) = 3.182
2
Decision: Reject H 0 if t0 t , n − 2
2
Since t0 = 0.000909 t0.025 3 d . f = 3.182
Therefore accept the null hypothesis. That is H 0 : 0 = 0.002
(iv) (1 − )100% = 95%
Here (1 − )100 = 95
(1 − ) = 0.95
= 0.05
= 0.025
2
Also (1 − )100% confidence interval for 0 is
1 x2 1 x2
0 − t , n − 2 MSRe s ( + ) 0 0 + t , n − 2 MSRe s ( + )
2 n S xx 2 n S xx
1 32 1 32
0 − 3.182 2 + 0 0 + 3.182 2 +
5 10 5 10
−4.7196 0 4.7196
Hypothesis Testing a Line Slope –A special case
Null Hypothesis, H o : 1 = 0
Alternative Hypothesis, H1 : 1 0
Sample size =n
Level of significance =
1
Test the statistic, t0 =
Sec( 1 )
MSRe s
Where Sec( 1 ) =
S xx
Decision: Reject the null hypothesis if t0 t ,(n − 2) d . f
2
Problem: The following are measurements of the air velocity and evaporation
coefficient of burning drop lets in an impulse engine.
AirVel(cm/ 20 60 100 140 180 220 260 300 340 380
sec)
Evo coeff 0.18 0.37 0.35 0.78 0.56 0.75 1.18 1.36 1.17 1.65
(mm2 / sec)
Test the null hypothesis 0 = 0 against the alternative Hypothesis 0 0 at the
0.05 level of significance.
Solution:
Null Hypothesis, H o : 0 = 0
Alternative Hypothesis, H1 : 0 0
Sample size n=10
Level of significance = = 0.05
0
Test the statistic, t0 =
Sec( 0 )
MSRe s
Where Sec(0 ) =
S xx
Where S xx = 132000, S xy = 505.40, S yy = 2.13745
S 2 xy SSRe s
SS Res = S yy − = 0.20238 MS Res = = 0.02529
S xx n−2
1
Test the statistic, t0 =
MSRe s
S xx
0.00383
t0 =
0.02529
132000
= 8.75
t0.025,8 = 2.306
Decision: Reject the null hypothesis if t0 t ,(n − 2) d . f
2
Clearly t0 = 8.75 t ,(n − 2) d . f = t0.025,8 = 2.306
2
So, reject the null hypothesis.