[go: up one dir, main page]

0% found this document useful (0 votes)
53 views52 pages

Simple Regression and Simple Correlation: MA261 Statistical and Numerical Techniques March 24, 2022

The document discusses simple linear regression and correlation. It defines regression as investigating relationships between variables, with one variable (response/dependent) being predicted from others (predictors/independent). Simple regression involves one predictor variable, while multiple regression has more. The method of least squares fits a straight line to minimize the vertical distances between the data points and the line. It provides a formula to calculate the slope and intercept of the least squares line.

Uploaded by

Harshal jethwa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views52 pages

Simple Regression and Simple Correlation: MA261 Statistical and Numerical Techniques March 24, 2022

The document discusses simple linear regression and correlation. It defines regression as investigating relationships between variables, with one variable (response/dependent) being predicted from others (predictors/independent). Simple regression involves one predictor variable, while multiple regression has more. The method of least squares fits a straight line to minimize the vertical distances between the data points and the line. It provides a formula to calculate the slope and intercept of the least squares line.

Uploaded by

Harshal jethwa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Simple Regression and Simple Correlation

MA261 Statistical and Numerical Techniques


March 24, 2022

1
Relationship between variables
Relationship between variables

Regression analysis is a conceptually simple method for


investigating functional relationships among variables.In any
system in which variable quantities change, it is of interest to
examine the effects that some variables exert (or appear to exert)
on others. Often there exists a functional relationship that is too
complicated to grasp or to describe in simple terms.

2
Relationship between variables:History

The term regression was first used as a statistical concept in


1877 by Sir Francis Galton. Galton made a study that showed
that the height of children born to tall parents tends to move
back, or "regress", toward mean height of population. He
designated the word regression as the name of the general
process of predicting one variable (the height of the children)
from another (the height of the parent).

3
Relationship between variables:Illustration

Suppose a university admissions director is interested to know


whether any relationship exists between students scores on
entrance examination and the students CGPA upon graduation.
The administrator has accumulated a random sample of data
from the records of the university.

4
Variables in Regression
Variables in Regression

We can distinguish two main types of variable at this stage. We


shall usually call these predictor variables (independent
variables) and response variables (dependent variables).For
example, in analyzing the effect of advertising expenditures on
sales, a marketing manager’s desire to predict sales would
suggest making sales the dependent variable. Advertising
expenditure would be the independent variable used to help
predict sales. In statistical notation, y denotes the dependent
variable and x denotes the independent variable.

5
Variables in Regression

We shall concerned with relationships of the form:


Response variable = Model function + Random error.

6
Regression Model
Introduction

If we denote the response variable by Y and the set of predictor


variables by X1 , X2 , . . . , Xp , where p denotes the number of
predictor variables. The true relationship between Y and
X1 , X2 , . . . , Xp can be approximated by the regression model.
Regression Model

Y = f (X1 , X2 , . . . , Xp ) + 𝜀 (1)

Where 𝜀 is assumed to be a random error representing the


discrepancy in the approximation.It accounts for the failure of
the model to fit the data exactly.

7
Linear Regression Model

An example is the linear regression equation


Linear Regression Equation

Y = 𝛽0 + 𝛽1 X1 + 𝛽2 X2 + . . . + 𝛽p Xp + 𝜀 (2)

Where 𝛽0 , 𝛽1 , . . . , 𝛽p called the regression parameters or


coefficients, are unknown constants to be determined
(estimated) from the data.

8
Regression Model:Classification

• A regression equation containing only one predictor


variable is called a simple regression equation.
• An equation containing more than one predictor variable is
called a multiple regression equation.
• When we deal only with one response variable, regression
analysis is called univariate regression and
• Where we have two or more response variables, the
regression is called multivariate regression.

9
Regression Model:Illustrations

The failure rate of a certain electronic device is suspected to


increase linearly with its temperature. The simple model is
therefore Y = 𝛽0 + 𝛽1 X1 + 𝜀 where Y is observed failure rate of
device and X1 is the temperature of device; the error term 𝜀
represents variation in failure rate of devices due to other
unknown factors.

10
Simple Linear Regression
Simple Linear Regression

We start with the simple case of studying the relationship


between a response variable Y and a predictor variable X1 .
Since we have only one predictor variable, we shall drop the
subscript in X1 and use X for simplicity.

11
Data in Simple Regression

Suppose we have observations on n subjects consisting of a


dependent or response variable Y and an explanatory variable
X.The observations are usually recorded as in Table (1).

12
Data in Simple Regression

Table 1: Data Used in Simple Regression

Observation Number Response Y Predictor X


1 y1 x1
2 y2 x2

.. .. ..
. . .
n yn xn

13
Data in Simple Regression

The following data pertain to the number of computer jobs per


day and the central processing unit (CPU) time required.

Table 2: Data for Regression

Number of jobs 1 2 3 4 5
CPU times 2 5 4 9 10

14
Data in Simple Regression:Scatter Diagram

Figure 1: Scatter diagram

12
CPU time
10

Number of Jobs
−1 0 1 2 3 4 5 6
15
−2
Data in Simple Regression:Scatter Diagram

Figure 2: Straight line with distance |ei |from plotted points

12
CPU time
10 0
1
8

6
2
1
4

2 0

Number of Jobs
−1 0 1 2 3 4 5 6
16
−2
Method of Least Squares

The least squares line through a set of points is defined as that


n
straight line which results in the smallest value of e2i .
Í
i=1
Table 3: Calculation for Least Squares Line

x y (x − x̄) (y − ȳ) (x − x̄)(y − ȳ) (x − x̄) 2


1 2 -2 -4 8 4
2 5 -1 -1 1 1
3 4 0 -2 0 0
4 9 1 3 3 1
5 10 2 4 8 4
Total 15 30 20 10

17
Method of Least Squares (cont. )

1Í n
(xi − x̄)(yi − ȳ)
n i=1
b1 = n

(xi − x̄) 2
n i=1
20
=
10
=2

18
Method of Least Squares (cont. )

n
Í n
Í
yi xi
i=1 i=1
b0 = − byx
n n
30 15
= −2
5 5
=0

Therefore least squares line is Ŷ = 2x

19
Computational Formula

Consider the notations for the sums of squares and sums of


cross-products.

n
!2
∑︁
n n xi
∑︁ ∑︁
i=1
Sxx = (xi − x̄) 2 = xi2 −
n
i=1 i=1
n
!2
∑︁
n n yi
∑︁ ∑︁
2 i=1
Syy = (yi − ȳ) = y2i −
n
i=1 i=1

20
Computational Formula

n
! n
!
∑︁ ∑︁
n n xi yi
∑︁ ∑︁
i=1 i=1
Sxy = (xi − x̄)(yi − ȳ) = xi yi −
n
i=1 i=1

21
Estimates

n
! n
!
∑︁ ∑︁
xi yi
n
∑︁ i=1 i=1
xi yi −
i=1
n Sxy
b1 = !2 =
n
∑︁ Sxx
xi
n
∑︁ i=1
xi2 −
i=1
n

22
Estimates (cont. )

n
∑︁ n
∑︁
yi xi
i=1 i=1
b0 = − b1
n n
= ȳ − b1 x̄

23
The Product-moment correlation
coefficient
Introduction

We wish to measure both the direction and the strength of the


relationship between Y and X.Two related measures, known as
the covariance and the correlation coefficient, are developed
below.

24
Introduction (cont. )

The scatter diagram of two variables indicates the direction of


relationship between variables.A positive correlation (Figure 3)
means that the scores tend to vary directly; as one score
increases, the other score increases, and, as one score decreases,
the other score decreases. A positive correlation exists, for
example, between heart rate and oxygen consumption.

25
Introduction (cont. )

A negative correlation (Figure 5) indicates an inverse


relationship; as one score increases, the other decreases.The
measure of strength of relationship is given by correlation
coefficient. The product-moment correlation coefficient, which
in its current formulation is due to Karl Pearson.

26
Positive Correlation

Figure 3 Positive Correlation

27
No Correlation

Figure 4 Uncorrelated/No Correlation

28
Negative Correlation

Figure 5 Negative Correlation

29
Correlation Coefficient

Covariance

n
∑︁
(xi − x̄)(yi − ȳ)
i=1
Cov(X, Y) =
n−1

If Cov(X, Y) > 0, then there is a positive relationship between Y


and X, but if Cov(X, Y) < 0, then the relationship is negative.
Unfortunately, Cov(X, Y) does not tell us much about the
strength of such a relationship because it is affected by changes
in the units of measurement.

30
Correlation Coefficient (cont. )

n
∑︁
(xi − x̄)(yi − ȳ)
i=1
(n − 1)
rxy = v v
n n
u
u u
u
u ∑︁ u ∑︁
t (x − x̄) 2 u
u t (y − ȳ) 2
i i
i=1 i=1
n−1 n−1

31
Correlation Coefficient (cont. )

After removing (n − 1) from numerator and denominator


Sxy
we get rxy = √ √︁
Sxx Syy
The correlation coefficient rxy can range from +1 perfect
association to -1 for perfect negative association.

32
Exercise
Exercise

Example
The time it takes to transmit a file always depends on the file
size. Suppose you transmitted 30 files, with the average size of
126 Kbytes and the standard deviation of 35 Kbytes. The
average transmittance time was 0.04 seconds with the standard
deviation of 0.01 seconds. The correlation coefficient between
the time and the size was 0.86. Based on this data, fit a linear
regression model and predict the time it will take to transmit a
400 Kbyte file.

33
Exercise (cont. )

Example
A mathematics placement test is given to all entering freshmen
at a small college. A student who receives a grade below 35 is
denied admission to the regular mathematics course and placed
in a remedial class. The placement test scores and the final
grades for 20 students who took the regular course were
recorded.

34
Exercise (cont. )

Table 4

Placement Test 50 35 35 40 55 65 35 60 90
Course Grade 53 41 61 56 68 36 11 70 79
Placement Test 35 90 80 60 60 40 55 50 65 50
Course Grade 59 54 91 48 71 47 53 68 67 79

35
Exercise (cont. )

(a) Plot a scatter diagram

(b) Find the equation of the regression line to predict course


grades from placement test scores.

(c) Graph the line on the scatter diagram.

(d) If 60 is the minimum passing grade, below which


placement test score should students in the future be denied
admission to this course?

(e) Determine the correlation coefficient and write your


comment.

36
Rank Correlation
Rank Correlation

When a number of individuals are arranged in order according


to some quality which they all possess to a varying degree, they
are said to be ranked. The arrangement as a whole is called a
ranking in which each member has a rank.

37
Rank Correlation

It is customary, but not essential, to denote the ranks by ordinal


numbers 1, 2, . . . , n where n is the number of objects. Thus the
object or individual which comes fifth in the ranking has the
rank 5.To say that the rank according to some quality is 5 is
equivalent to saying that,in arranging according to a quality,
four members are given priority over particular member, are
preferred to it.

38
Rank Correlation

In practice, ranked material can arise in many different ways:

(a) Purely as arrangements of objects which are being


considered only by reference to their position in space or
time.
(b) According to some quality which we cannot measure on
any objective scale.
(c) According to some measurable or countable quality.
(d) According to some quality which we believe to be
measurable but cannot measure for practical or theoretical
reasons.

39
Rank Correlation

Example
Suppose ten programmers are ranked according to their ability
in Algorithm development and Software development.

Table 5: Rank

Programmer: A B C D E F G H I J
Algorithm: 7 4 3 10 6 2 9 8 1 5
Software devlopment: 5 7 3 10 1 9 6 2 8 4

40
Rank Correlation (cont. )

We are interested in whether there is any relationship between


ability in Algorithm development and Software
development.We now discuss coefficient of rank correlation
denoted by the Greek letter 𝜌 (rho) and named after C.
Spearman, who introduced it into psychological work.

41
Computing Rank Correlation

Table 6: Calculation using Spearman’s Method

Programmer: A B C D E F G H I J
Algorithm: 7 4 3 10 6 2 9 8 1 5
Software devlopment: 5 7 3 10 1 9 6 2 8 4
Differences d: 2 -3 0 0 5 -7 3 6 -7 1
d2 : 4 9 0 0 25 49 9 36 49 1

42
Computing Rank Correlation (cont. )

We define Spearman’s 𝜌 by the equation,


n
di2
Í
6
i=1
𝜌 =1−
n(n2 − 1)

in present example,

43
Computing Rank Correlation (cont. )

6 × 182
𝜌 =1−
10 × (100 − 1)
6 × 182
=1−
990
990 − 1092
=
990
102
=−
990
= −0.10303

44
Computing Rank Correlation (cont. )

We note that when two rankings are identical 𝜌 = 1 and one


ranking is reverse of the other 𝜌 = −1. Therefore, the coefficient
𝜌 satisfies −1 ≤ 𝜌 ≤ +1.

45

You might also like