Correlation and Regression
Dr. Anup Kumar Sharma
Department of Mathematics
November 24, 2022
Dr. Anup Kumar Sharma Correlation and Regression
Correlation and regression
Most of our discussions so far have been defined to single
variable (univariate data). In statistical work, we often
have to deal with problems involving more than one
variable. Our interest now lies in studying the relationship
between two variables.
Suppose we have data on two variable, say x and y, for
each individual in a group, e.g., x may be height and y is
weight of adult male of some athelic group. Our data will
then consist of a number of pair of values of x and y;
{xi , yi }, i = 1, 2, ..., n. This type of data is called bivariate
data.
Dr. Anup Kumar Sharma Correlation and Regression
Correlation and regression
How can we explore the association between to quantitative
variables?
An association exists between two variables if a particular value
of one variable is more likely to occur with certain values of the
other variable.
For higher levels of energy use, does the CO2 level in the
atmosphere tend to be higher? If so, then there is anassociation
between energy use and CO2 level.
Positive Association: As x goes up, y tends to go up.
Negative Association: As x goes up, y tends to go down.
Dr. Anup Kumar Sharma Correlation and Regression
Correlation and regression
How can we explore the relationship between two quantitative
variables?
Graphically, we can construct a scatterplot.
Numerically, we can calculate a correlation coefficient and a
regression equation.
Dr. Anup Kumar Sharma Correlation and Regression
Correlation and regression
Correlation
The degree of extent to which variables are linearly related
is called correlation between two variables.
If it is found that as one variable increases, the other also
increases on the average. There will be said to be positive
correlaion between two variables.
If it is found that as one variable increases, the other
variable decreases on the average, we then say that there is
negative correlation between the two variables.
There still may be a third situation where as one variable
increases, the other remains constant on the average, this is
the case of zero or no correlation.
Above consideration are appropriate in case the variables are
found to be leanearly related, at least in appropriate sense.
Dr. Anup Kumar Sharma Correlation and Regression
Correlation
Use of scatter diagram
The scattered diagram serves as a useful technique in the study
of the relationship and also for measuring the extent of the
linear relationship.
Dr. Anup Kumar Sharma Correlation and Regression
Correlation
Correlation Coefficient
It is a measure of linear association between two variables.
The correlation coefficent of variables x and y, denoted by
rxy (or simply r when there is no scope of confusion), is
defined as
rxy = √ Cov(x,y)
√ .
V ar(x) V ar(y)
where Cov(x, y) denotes the covariance of x and y.
If we are given n pairs of values (xi , yi ), i = 1, 2, ..., n of
variables x and y,
n
Cov(x, y) = n1
P
(xi − x̄)(yi − ȳ),
i=1
where x̄ and ȳ are means of values of x and y respectively.
Dr. Anup Kumar Sharma Correlation and Regression
Correlation
Correlation Coefficient
So, we can write
n n
1 P 1 P
n
(xi −x̄)(yi −ȳ) n
xi yi −x̄ȳ
i=1 i=1
rxy = s
n
s
n
= s
n
s
n
1 1 P 1 1 P 2
(xi −x̄)2 n (yi −ȳ)2 x2i −x̄2 n yi −ȳ 2
P P
n n
i=1 i=1 i=1 i=1
Correlation coefficeint rxy of variables x and y is
symmetric, i.e. rxy = ryx .
−1 ≤ rxy ≤ 1.
r > 0 indicates a positive association.
r < 0 indicates a negative association.
Values of r near 0 indicate a very weak linear relationship.
The strength of linear relationship increases as r moves
away from 0.
Dr. Anup Kumar Sharma Correlation and Regression
Example
The following data are based on information from domestic
affairs. Let x be the average number of employees in a group
health insurance plan, and let y be the average administrative
cost as a percentage of claims. Calculate the correlation
coefficient r.
x y
3 40
7 35
15 30
35 25
75 18
Dr. Anup Kumar Sharma Correlation and Regression
Correlation
x y x2 y2 xy
3 40 9 1600 120
7 35 49 1225 245
15 30 225 900 450
35 25 1225 625 875
75 18 5625 324 1350
n n n n n
x2i = 7133 yi2 = 4674
P P P P P
xi = 135 yi = 148 xi yi = 3040
i=1 i=1 i=1 i=1 i=1
n = 5n
P
xi
i=1 135
x̄ = n = 5 = 27
n
P 148
ȳ = yi = 5 = 29.6
i=1 s r
n 1
1
p
x2i x̄2 × 7133 − (27)2 = 26.41
P
V ar(x) = n − =
i=1 5
s r
n 1
1
p
yi2 ȳ 2 × 4674 − (29.6)2 = 7.658
P
V ar(y) = n − =
i=1 5
Dr. Anup Kumar Sharma Correlation and Regression
Correlation
n
1 P 1
Cov(x, y) = n xi yi − x̄ȳ = 5 × 3040 − 27 × 29.6 = 5289
i=1
rxy = √ Cov(x,y)
√ = 5289
26.42×7.658 = −0.95
V ar(x) V ar(y)
(strong negative correlation between x and y)
Dr. Anup Kumar Sharma Correlation and Regression
Regression
Once we’ve acquired data with two variables, one very
important question is how the variables are related. For
example, we could ask for the relationship between people’s
weights and heights, or study time and test scores, or two
animal populations. Regression is a set of techniques for
estimating relationships.
We’ll focus on finding one of the simplest type of
relationship: linear.
This process is unsurprisingly called linear regression, and
it has many applications.
For example, we can relate the force for stretching a spring
and the distance that the spring stretches, or explain how
many transistors the semiconductor industry can pack into
a circuit over time.
Despite its simplicity, linear regression is an incredibly
powerful tool for analyzing data.
Dr. Anup Kumar Sharma Correlation and Regression
Regression
Suppose we have collected bivariate data
(xi , yi ), i = 1, 2, ..., n. The goal of linear regression is to
model the relationship between x and y by finding a
function y = f (x) that is close fit to the data.
Asumption: xi is not random and that yi is a function of
xi plus some random noise.
With these assumptions x is called the independent or
predictor variable and y is called the dependent or response
variable.
The equation of the regression line has the form :
y = a + bx
where a denotes the y -intercept, and b denotes the slop.
Dr. Anup Kumar Sharma Correlation and Regression
Dr. Anup Kumar Sharma Correlation and Regression
Estimation of the regression line: the method of least
squares
If for a given set of paired observation {(xi , yi ), i = 1, 2, ..., n}
the correlation coefficient |rxy | is quite high(i.e. close to 1), then
it indicates that there is a near linear relationship between x
and y. To estimate that relationship we fit a line on the
observed set of data by the priciple of least square.
If Yi is the regression estimate of yi , under the assumption of
linear relationship between x and y , our predicted formula is:
yi = Yi + ei ,
where Yi is the predicted value of yi and ei is the residual or
error in the prediction when x = xi .
Dr. Anup Kumar Sharma Correlation and Regression
Here the intercept a and slop b of the regression equation is
estimated by the method of least squares which consists of
minimizing the error sum of square (SSE),
n n n
S2 = e2i = (yi − Yi )2 = (yi − a − bxi )2 w.r.t. a and b.
P P P
i=1 i=1 i=1
σ
Using simple calculus, b = byx = Cov(x,y) y
V ar(x) = rxy σx , and
a = ȳ − bx̄.
Hence, the least square regression line of y on x is :
y − ȳ = byx (x − x̄) .
σ
where byx = rxy σxy is called the regression coefficient of y on
x.
Similarly, the least square regression line of x on y is :
x − x̄ = bxy (y − ȳ) .
where bxy = rxy σσxy is called the regression coefficient of x on
y.
Dr. Anup Kumar Sharma Correlation and Regression
Example
The following data are based on information from domestic
affairs. Let x be the average number of employees in a group
health insurance plan, and let y be the average administrative
cost as a percentage of claims. Calculate the correlation
coefficient r.
x y
3 40
7 35
15 30
35 25
75 18
(i) Find the least square line of y on x.
(ii) Find the least square line of x on y.
(iii) Estimate the value of y when x = 95.
Dr. Anup Kumar Sharma Correlation and Regression
Solution
Peviously, we have calculated:
x̄ = 27, ȳ = 29.6,
rxy = −0.95
p
σx = V ar(x) = 26.41
p
σy = V ar(y) = 7.658.
So,
σy 7.658
byx = rxy = (−0.95) × = −0.28
σx 26.41
σx 26.41
bxy = rxy = (−0.95) × = −3.28
σy 7.658
(i) The least square regression line of y on x is:
y − ȳ = byx (x − x̄)
=⇒ y − 29.6 = −0.28 × (x − 27)
=⇒ y = −0.28x + 37.16
Dr. Anup Kumar Sharma Correlation and Regression
(ii) The least square regression line of x on y is:
x − x̄ = bxy (y − ȳ)
x − 27 = −3.28 × (y − 29.6)
=⇒ x = −3.28y + 124.08
(iii)The estimate of y when x = 95 :
ŷ(x = 95) = −0.28 × (95) + 37.16
= 10.56
Dr. Anup Kumar Sharma Correlation and Regression
Dr. Anup Kumar Sharma Correlation and Regression