Chapter 4 (Regression Part)
Chapter 4 (Regression Part)
CHAPTER FIVE
REGRESSION ANALYSIS
Regression is the functional relationship between two variables and of the two variables one may
represent cause and the other may represent effect. The variable representing cause is known as
independent variable and is denoted by X. The variable X is also known as predictor variable or
repressor. The variable representing effect is known as dependent variable and is denoted by Y.
Y is also known as predicted variable. The term “regression” was used by a famous Biometrician
Sir. F. Galton (1822-1911) in 1877.
Assumptions
2. At each fixed value of X the corresponding values of Y have a normal distribution about a
mean.
i. To estimate the relationship that exits, on the average, between the dependent variable and
independent variables.
ii. To determine the effect of each independent variable on the dependent variable, controlling
the effects of the others independent variables.
Dr. Manju, Associate Professor, CSE , IIUC
iii. To predict the value of the dependent variable for a given value of the explanatory variables
The simplest form of the regression model that displays the relation between X and Y is a
straight line, which appears as follows: 𝐘=a + bx
here denotes the predicted value of Y, a is the intercept and b is the slope of the straight line.
In regression terminology, b is the regression coefficient of Y on X. This straight line is called
the fitted line of Y.
The least-squares method is a technique for minimizing the sum of the squares of the differences
between the observed values and estimated values of the dependent variable. That is the least-
squares line is the line that minimizes Σ𝐞𝐢𝟐 =Σ (𝐘𝐢−𝐛𝐗𝐢−𝐚)𝟐
To minimizes SSE with respect to a, and b, from calculus we know that the partial derivatives of
2
SSE with respect to a, and b must be 0. Then ei =−2Σ (Yi−bXi−a) =0
a
2
ei =−2Σ (Yi−bXi−a) Xi=0
b
(x
i 1
i x )( y i y )
b yx n
x
i 1
i x
2
and 𝐚=𝐘−𝐛𝐗
Regression coefficient
Let, (x1,y1), (x2,y2)……….. (xn,yn) be the pairs of n observations. Then the regression coefficient
of y on x is denoted by byx and defined by
n
(x
i 1
i x )( y i y )
b yx n
x
i 1
i x
2
(x
i 1
i x )( y i y )
b xy n
y
i 1
i y
2
Regression lines:
If we consider two variables X and Y, we shall have two regression lines as the regression line of
Y on X and the regression line of X on Y. The regression line of Y on X gives the most probable
Dr. Manju, Associate Professor, CSE , IIUC
values of Y for given values of X and The regression line of X on Y gives the most probable
values of X for given values of Y. Thus we have two regression lines. However, when there is
either perfect positive or perfect negative correlation between the two variables, the two
regression lines will coincide i.e, we will have one line.
Regression equation:
y = a + bx, where y is the dependent variable to be estimated and x is the independent variable, a
is the intercept term (assume mean) and b is the slope of the line.
n
y b x ( x x )( y
i i y)
Here, a y bx and b i 1
n
x x
n n 2
i
i 1
n n
xi y i
xi y i i 1 n i 1
2
n
xi
i 1
n
i 1
xi
2
x = a + by, where x is the dependent variable to be estimated and y is the independent variable, a
is the intercept term (assume mean) and b is the slope of the line.
a x by
x b y
n n
n n
xi yi
xi yi i1 n i1
n
( x x )( y
i i y)
And b i 1
n
2
n
yi y 2 yi
yi i1
i 1 n
i 1
2
n
Dr. Manju, Associate Professor, CSE , IIUC
b b
Coefficient. i.e, yx xy rxy
2
6. If one of regression coefficient is greater than unity the other must be less than
Coefficient of Determination R2 :
The coefficient of determination, r 2, is useful because it gives the proportion of the variance
(fluctuation) of one variable that is predictable from the other variable. It is a measure that allows
us to determine how certain one can be in making predictions from a certain model/graph. The
coefficient of determination is the ratio of the explained variation to the total variation.
2
The coefficient of determination is such that 0 < r < 1, and denotes the strength of the
linear association between x and y.
The coefficient of determination represents the percent of the data that is the closest to the line
2
of best fit. For example, if r = 0.922, then r = 0.850, which means that 85% of the total
variation in y can be explained by the linear relationship between x and y (as described by the
regression equation). The other 15% of the total variation in y remains unexplained.
The coefficient of determination is a measure of how well the regression line represents the
data. If the regression line passes exactly through every point on the scatter plot, it would be
able to explain all of the variation. The further the line is away from the points, the less it is able
to explain.
Dr. Manju, Associate Professor, CSE , IIUC
Theorem: Show that correlation coefficient is the geometric mean of regression coefficients. i.e,
rxy= byx bxy
Proof: Let, Let, (x1,y1), (x2,y2)……….. (xn,yn) be the pairs of n observations. Then the correlation
coefficient between x and y is denoted by rxy and defined as,
n
( x x )( y y ) i i
rxy i 1
……………..(1)
n n
x x y y
i 1
i
2
i 1
i
2
( x x )( y y )
i i
Again, the regression coefficient of y on x is, byx i 1
n
x x
i 1
i
2
(x
i 1
i x )( y i y )
b xy n
y
i 1
i y
2
(x
i 1
i x )( y i y ) x
i 1
i x y i y
b yx b xy n
n
x i x
i 1
2
y
i 1
i y
2
( x x )( y y )
i i
byx bxy i 1
= rxy (proved)
n n
x x y y
i 1
i
2
i 1
i
2
Theorem: The arithmetic mean of two regression coefficient is greater than correlation
b b
coefficient. i.e, yx xy rxy
2
Proof: Let, (x1,y1), (x2,y2)……….. (xn,yn) be the pairs of n observations. Then the regression
coefficient of y on x is denoted by byx and the regression coefficient of x on y is denoted by bxy.
b b
The arithmetic mean of byx and bxy is yx xy and the geometric mean is byx bxy
2
Dr. Manju, Associate Professor, CSE , IIUC
b b
or, yx xy byx bxy
2
b b
or, yx xy r (proved
2
Uses of regression:
Application problem-1: A researcher wants to find out if there is any relationship between the
ages of husbands and the ages of wives. In other words, do old husbands have old wives and
Dr. Manju, Associate Professor, CSE , IIUC
young husbands have young wives? He took a random sample of 7 couples whose respective
ages are given below:
n n
xi yi
xi yi i1 n i1
Where, b = 2
and a = y - b x
n
xi
xi i 1
n
i 1
2
Computation table
x y x2 y2 xy
39 37 1521 1369 1443
25 18 625 324 450
29 20 841 400 580
35 25 1225 625 875
32 25 1024 625 800
27 20 729 400 540
37 30 1369 900 1110
x 224 y 175 x 2 7334 y 2 4643 xy 5798
n n
xi yi
xi yi i1 n i1 5798
224 175
Here, b = = 7 = 1.193
n
2
224
2
xi 7334
7
xi i 1
n
i 1
2
n
Dr. Manju, Associate Professor, CSE , IIUC
And a = y - b x =
y -b
x
n n
=
175
-(1.193)
224 = 25-38.176 = -13.176
7 7
Hence, if the age of husband is 45, the probable age of wife would be
n n
xi yi
xi yi i1 n i1
Where, b = 2
n
yi
yi i1
n
i 1
2
5798
224 175
= 7 = 0.739
4643
175
2
And a = x -by =
x b y
n n
224 175
= 0.739 = 13.525
7 7
x̂ = a + by
Application problem-2: A research physician recorded the pulse rates and the temperatures of
water submerging the faces of ten small children in cold water to control the abnormally rapid
Dr. Manju, Associate Professor, CSE , IIUC
heartbeats. The results are presented in the following table. Calculate the correlation coefficient
and regression coefficients between temperature of water and reduction in pulse rate.
Temperature of water 68 65 70 62 60 55 58 65 69 63
Reduction in pulse rate. 2 5 1 10 9 13 10 3 4 6
b b
Also show that (i) yx xy rxy
2
n n
xi yi
xi yi i1 n i1
We know, rxy =
n
2
n
2
n xi n yi
i1
xi yi
2 i 1 2
i1 n i1 n
635 63
3835
= 10 = -0.94
635 541 63
2
2
40537
10 10
Dr. Manju, Associate Professor, CSE , IIUC
( x x )( y y )
i 1
i i
We know, the regression coefficient of y on x is, byx = n
x x
i 1
i
2
n n
xi yi
xi yi i1 n i1 3835
635 63
1655
= 10 = -0.77
n
2
635
2
2145
xi 40537
10
xi i1
n
i 1
2
( x x )( y y )
i 1
i i
Regression coefficient of x on y is, bxy = n
y y
i 1
i
2
n n
xi yi
xi yi i1 n i1 3835
635 63
1655
= = 10 = -1.1
n
2
63
2
1441
yi 541
10
yi i1
n
i 1
2
b b
(i) yx xy rxy
2
b b 0.77 1.1
Here, yx xy = -0.94 = rxy
2 2
Assignment Problem-1: The following data give the test scores and sales made by nine
salesmen during the last year of a big departmental store:
Test Scores: y 14 19 24 21 26 22 15 20 19
Sales(in lakh Taka) 31 36 48 37 50 45 33 41 39
i. Find the regression equation of test scores on sales. Ans: ŷ = -2.4 + 0.56x
ii. Find the test scores when the sale is Tk. 40 lakh. Ans: 20 lakh
iii. Find the regression equation of sales on test scores. Ans: x̂ = 7.8 + 1.61y
iv. Predict the value of sale if the test score is 30. Ans: 56.1 lakh
v. Compute the value of correlation coefficient with the help of regression coefficients.
Dr. Manju, Associate Professor, CSE , IIUC
Assignment Problem-2: The following table gives the ages and blood pressure of 10 women:
Age in years 56 42 36 47 49 42 72 63 55 60
x
Blood pressure 147 125 118 128 125 140 155 160 149 150
y
Obtain the regression line of y on x. Ans: ŷ = 83.76+ 1.11x
Estimate the blood pressure of a women whose age is 50 years. Ans: 139.26
Assignment Problem-3: Consider the following data set on two variables x and y:
x:1 2 3 4 5 6
y:6 4 3 5 4 2
Assignment Problem-4: Cost accountants often estimate overhead based on production. At the
standard knitting company, they have collected information on overhead expenses and units
produced at different plants and what to estimate a regression equation to predict future
overhead.
Units 56 40 48 30 41 42 55 35
Assignment Problem-5: The following data refer to information about annual sales
Salesmen 1 2 3 4 5 6 7 8
Year of experience 7 4 5 6 11 12 13 17