[go: up one dir, main page]

0% found this document useful (0 votes)
35 views39 pages

Simple Linear Regression and Correlation

This document discusses Simple Linear Regression and Correlation, detailing the model, estimation, and correlation coefficients. It explains the relationship between dependent and independent variables, how to compute regression equations, and the significance of scatter plots in visualizing data. Additionally, it covers the assumptions of the regression model and methods for calculating correlation, emphasizing the importance of understanding linear relationships between variables.

Uploaded by

dosantosav21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views39 pages

Simple Linear Regression and Correlation

This document discusses Simple Linear Regression and Correlation, detailing the model, estimation, and correlation coefficients. It explains the relationship between dependent and independent variables, how to compute regression equations, and the significance of scatter plots in visualizing data. Additionally, it covers the assumptions of the regression model and methods for calculating correlation, emphasizing the importance of understanding linear relationships between variables.

Uploaded by

dosantosav21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Simple Linear Regression and

Correlation

5.1 The Simple Linear Regression Model


5.2 Estimation: Model Building
5.3 The Simple Linear Correlation
5.3.1 Coefficient of Correlation
5.3.2 Rank Correlation Coefficient

1
Learning Objectives
After studying this chapter, you should be able to:
• Determine whether a regression
experiment would be useful in a given
instance
• Formulate a regression model
• Compute a regression equation
• Compute the covariance and the
correlation coefficient of two random
variables
2
5.1 Simple Linear regression
• Regression refers to the statistical technique of modeling the
relationship between variables.
• In simple linear regression, we model the relationship
between two variables, using linear equation.
• One of the variables, denoted by Y, is called the dependent
variable and the other, denoted by X, is called the
independent variable.
• The model we will use to depict the relationship between X
and Y will be a straight-line relationship.
• A graphical sketch of the pairs (X, Y) is called a scatter plot.

3
Scatter Diagram
This scatter plot locates pairs of S catterplot of Advertising Expenditures (X) and S ales (Y)
observations of advertising expenditures 140

on the x-axis and sales on the y-axis. We 120

notice that: 100

S ale s
80

60

✓ Larger (smaller) values of sales tend to 40

be associated with larger (smaller) values 20

of advertising. 0
0 10 20 30 40 50
A d ve rtis ing

✓ The scatter of points tends to be distributed around a positively sloped straight line.
✓ The pairs of values of advertising expenditures and sales are not located exactly on a
straight line.
✓ The scatter plot reveals a more or less strong tendency rather than a precise linear
relationship.
✓ The line represents the nature of the relationship on average.
4
Examples of Other Scatter plots

Y
Y
Y

0 X X

Y
Y

X X
X
5
5.2 Estimation: Model Building
The inexact nature of the Data In regression, the
relationship between systematic component is
advertising and sales the overall linear
suggests that a statistical relationship, and the
model might be useful in Statistical random component is the
analyzing the relationship. model variation around the line.

A statistical model In ANOVA, the systematic


separates the systematic component is the variation
Systematic
component of a of means between samples
relationship from the component or treatments (SSTR) and
random component. + the random component is
Random the unexplained variation
errors (SSE).

6
The Simple Linear Regression Model
The population simple linear regression model:
Y= 0 + 1 X + 
Nonrandom or Random
Systematic Component
Component
where
✓ Y is the dependent variable, the variable we wish to explain or predict
✓ X is the independent variable, also called the predictor variable
✓  is the error term, the only random component in the model, and thus, the
only source of randomness in Y.
✓ 0 is the intercept of the systematic component of the regression relationship.
✓ 1 is the slope of the systematic component.

The conditional mean of Y: E [Y X ] =  0 + 1 X

7
Assumptions of the Simple Linear
Regression Model
• The relationship between X and Y is a straight-
line relationship.
• The values of the independent variable X are
assumed fixed (not random); the only
randomness in the values of Y comes from the
error term i.
• The errors i are normally distributed with mean
0 and variance 2. The errors are uncorrelated
(not related) in successive observations. That is:
~ N(0,2)
8
Picturing the Simple Linear
Regression Model
The simple linear regression
Y Regression Plot model gives an exact linear
relationship between the
expected or average value of Y,
E[Y]=0 + 1 X the dependent variable, and X,

{
Yi the independent or predictor
Error: i } 1 = Slope
variable:
}

E[Yi]=0 + 1 Xi
1

0 = Intercept
Actual observed values of Y
differ from the expected value by
X
an unexplained or random error:
Xi Yi = E[Yi] + i
= 0 + 1 Xi + i

9
Estimation: The Method of Least
Squares
• Estimation of a simple linear regression relationship involves finding
estimated or predicted values of the intercept and slope of the linear
regression line.
The estimated regression equation:

Y = b0 + b1X + e
where b0 estimates the intercept of the population regression line, 0 ;
b1 estimates the slope of the population regression line, 1;
and e stands for the observed errors - the residuals from fitting the
estimated regression line b0 + b1X to a set of n points.
The estimated regression line:

Y = b0 + b1 X
where Y (Y - hat) is the value of Y lying on the fitted regression line for a given
value of X.
10
Errors in Regression

Y
the observed data point
Y = b0 + b1 X the fitted regression line

{
Yi
Error ei = Yi − Yi

Yˆi
Yi the predicted value of Y for X
i

Xi
X

11
Least Squares Regression (Cont)
The sum of squared errors in regression is :
n n
SSE =  e i2 =  (y i − yˆ i ) 2
i =1 i =1

The least squares regression line is that which minimizes the SSE
with respect to the estimates b 0 and b1 . Taking the partial derivative with
respect to b 0 and b1 , we have the normal equations :
n n

y
i =1
i = nb0 + b1  x i
i =1

n n n

 x i y i =b0  x i + b1  x i2
i =1 i =1 i =1

Then solve for the parameters b 0 and b1 we have :


n
 n  n 
n x i y i −  x i    y i 
b1 = i =1  i =1   i =1  and b 0 = y − b1 x
2
n
 n 
n x i −   x i 
2

i =1  i =1 

12
Least Squares Regression (Cont)
• Therefore, the regression line would be:
Yˆi = b + b X i
0 1

• On the other hand we can derive the


regression line from sum of squares and cross
products:
( x ) 2

SS x =  ( x i − x ) =  xi
2 i
2

n
( y ) 2

SS y =  ( y i − y ) 2 =  y i
2 i

n
( x )( y )
SS xy =  ( xi − x )( y i − y ) =  xi y i −
i i

13
Least Squares Regression (Cont)
• The least-square regression estimators are:
SS XY
b1 = and b0 = y − b1 x
SS X
• Thus, again the regression line would be:
Yˆi = b0 + b1 X i
✓ This equation is called the line of regression of Y on X.
✓ b1 some times called the regression coefficient.
✓ The line of regression of Y on X passes through the point (X , Y )
✓ If b is positive the relationship between X and Y is called a
positive (or direct) linear relationship.
✓ If b is negative the relationship between X and Y is called a
negative (or indirect) linear relationship.

14
Example on Regression line
A consumer welfare wants to investigate the
relationship between the size of houses and
rents paid by tenants in a city. The agency
collected the following information as the size
(in hundreds of square meters) of houses and
monthly rents in dollars.
Size of houses (in 100m2 ) 0.16 0.19 0.21 0.23 0.27 0.32
Monthly rent (in Dollars) 170 180 210 250 300 360

15
Example on Regression line (Cont)
a) Find the regression line with the size of
house as an independent variable and
monthly rent as a dependent variable.
b) Predict the monthly rent for a house with
25 square meters.

16
Solution
xi yi xi y i xi2 y i2
0.16 170 27.2 0.0256 28900 1 n 1
0.19 180 34.2 0.0361 32400 x =  x i = (1 .38) = 0.23 and
n i =1 6
0.21 210 44.1 0.0441 44100
0.23 250 57.5 0.0529 62500 1 n 1
0.27 300 81.0 0.0729 90000 y =  y i = (1470) = 245
n i =1 6
0.32 360 115.2 0.1024 129600
1.38 1470 359.6 0.3340 387500
n
 n  n 
n x i y i −  x i    y i 
 i =1   i =1  6(359.2) − (1.38)(1470)
b1 = i =1 = = 1271.08 and
n
 n 
2
6(0.334) − 1.38 2
n x i −   x i 
2

i =1  i =1 

b 0 = y − b1 x = 245 − 1271.08(0.23) = −47.34


Therefore, the regression line will be
Yˆ = −47.34 + 1271.08 X
i i
17
Solution(cont)
On the other hand from the table:
( x ) 2
1.38 2
SS x =  ( x i − x ) =  xi
2 i
2
− = 0.3340 − = 0.0166
n 6
( y ) 2
1470 2
SS y =  ( y i − y ) =  y i
2 i
2
− = 387500 − = 27350
n 6
( xi )( yi ) (1.38)(1470)
SS xy =  ( xi − x )( y i − y ) =  xi y i − = 359.2 − = 21.1
n 6
SS xy 21.1
b0 = = = 1271.08
SS x 0.0166
b1 = y − b0 x = 245 − 1271.08(0.23) = −47.34

Hence the regression line will be again:


Yˆi = b0 + b1 X i = −47.34 + 1271.08 X i

18
Solution (Cont)

b) For a 25 square meter →X = 0.25 , hence


i

the estimated monthly rent will be:


Yˆi = −47.34 + 1271.08 X i
= −47.34 + 1271.08 (0.25)
= 270.42 Dollars

19
5.3 Simple linear Correlation
• The correlation between two variables, X and
Y, is a measure of the degree of linear
association between the two variables.
• Types of correlations
✓Positive (direct) correlation:-the two variables
deviate in the same direction.
✓Negative (inverse) correlation:- the two variables
deviate in the opposite direction.

20
Simple linear Correlation
• Commonly used method for studying the
correlation between X and Y are:
✓Scatter diagram method
✓Covariance method (Karl-Pearson’s coefficient )
✓Rank correlation method

21
Scatter diagram and correlation

Y Y Y r= 1
r= -1 r= 0

X X X

Y r= -0.8 Y r= 0 Y r=0 .8

X X X

22
5.3.1Coefficient of Correlation method

• The covariance of two variables X and Y is


1 n
Cov ( X , Y ) =  [( xi − x )( y i − y )
n i =1

but dependent on the unit of measurement


used for X and Y
• The larger the absolute value of Cov ( X , Y ) imply
the greater the relationship between X and Y,
particularly: +  direct relationship between X and Y

Cov ( X , Y ) = 0  no linear relationship between X and Y
−  inverse relationship between X and Y

23
Coefficient of Correlation Method
• The Karl-Pearson’s coefficient of correlation
(r) is a relative measure covariance obtained
by dividing the covariance of X and Y by the
standard deviation of the two variables
product.
Cov ( X , Y ) S XY
r= =
S XX S YY S XX S YY
1 n 1 n
where S XX =  ( x i − x ) 2
and S YY =  ( y i − y ) 2

n i =1 n i =1

24
Coefficient of Correlation Method
• By simple substitution and simplification we
have:
n xi y i − ( xi )( y i )
r=
(n x 2
i
2
)(
− ( xi )  n y i2 − ( y i )
2
)

• The coefficient of correlation gives a measure


of the strength of the relationship between
the two variables provided that the
relationship is linear.

25
Coefficient of Correlation Method
• If r = 1 indicate a perfect positive linear correlation
If 0  r  1 indicate a positive linear correlation
If r = 0 indicate no linear relationship
If − 1  r  0 indicate a negative linear correlation
If r = −1 indicate a perfect negative linear correlation
• The absolute value of r indicates the strength or
exactness of the relationship.

26
Example on coefficient correlation
An experiment is conducted in a supermarket to
observe the relationship between the amount of
displayed space allotted to brand of coffee (brand A)
and its weekly sales. The amount of space allotted to
brand A was over 3, 6, and 9 square-foot displays in a
random manner over 12 weeks while the space
allotted to competing brands was maintained at a
constant 3 feet for each. The following data were
observed.
Space allotted (X) 6 3 6 9 3 9 6 3 9 6 3 9
Weekly sales (Y) 526 421 581 630 412 560 434 443 590 570 346 672

27
Example (cont)
a) Find the regression line space allotted as
independent variable and weekly sales as
dependent variable.
b) Find the coefficient of correlation and
interpret the result.

28
Solution
We have the following table from the data given
xi yi xiyi xi2 yi2
6 526 3156 36 276676 72 6185
3 421 1263 9 177241 a) x= =6& y= = 515.42
12 12
n xi y i − ( xi )( y i )
6 581 3486 36 337561
9 630 5670 81 396900 b1 =
n xi2 − ( xi )
2
3 412 1236 9 169744
9 560 5040 81 313600 12(39600) − 72(6185)
6 434 2604 36 188356 =
12(504) − 72 2
3 443 1329 9 196249
9 590 5310 81 348100 = 34.58 and
6 570 3420 36 324900 b0 = y − b1 x
3 346 1038 9 119716 = 515.42 − 34.58(6)
9 672 6048 81 451584 = 307.94
72 6185 39600 504 3300627
29
Solution (cont)
The equation of the regression line will be:
Yˆi = b0 + b1 X i = 307.94 + 34.58 X i
b) The coefficient of correlation will be:
n xi y i − ( xi )( y i )
r=
(n x 2
i
2
)(
− ( xi )  n y i2 − ( y i )
2
)
12(39600) − 72(6185)
= = 0.874
[(12) (504) − 72 ] [(12)(3300627) − 6185 ]
2 2

There is a strong positive correlation between the


space allotted and weekly sales.
30
Coefficient of Determination
• The square of r is called the coefficient of
determination which gives the percentage of
variation in the dependent variable that is
accounted for by the independent variable.
• In the above example:
r 2  100% = 0.874 2  100% = 76.39%

• 76.39% of the weekly sales depends on the space


allotted for sales in the supermarket and 23.61%
of the weekly sales depends on other factors.
31
Example
• Assume you are in charge of money for a given country. You
are given the following historical data on the money supply
and gross national product (GNP), both in millions of
dollars.
Money supply 2.0 2.5 3.2 3.6 3.3 4.0 4.2 4.6 4.8 5.0
GNP 5.0 5.5 6.0 7.0 7.2 7.7 8.4 9.0 9.7 10.0

a) Develop the estimating equation to predict GNP from


money supply.
b) How do you interpret the slope of the regression line.
c) Predict the GNP when the money supply is $4,300, 000

32
Solution
xi yi xiyi xi 2 yi2
2.0 5.0 10.00 4.00 25.00
2.5 5.5 18.75 6.25 30.25
3.2 6.0 19.20 10.24 36.00
3.6 7.0 25.20 12.96 49.00
3.3 7.2 23.76 10.89 51.84
4.0 7.7 30.80 16.00 59.29
4.2 8.4 35.25 17.64 70.56
4.6 9.0 41.40 21.16 81.00
4.8 9.7 46.56 23.04 94.09
5.0 10.0 50.00 25.00 100.00
37.2 75.5 295.95 147.18 597.03

33
5.3.2 Rank Correlation Coefficient
• Rank correlation coefficient is used to perform
correlation analysis to a form of qualitative
data.
• Rank correlation coefficient is a measure of
the correlation that exists between the two
sets of ranks, a measure of degree of
association between two variables that we
would not have been able to calculate.

34
Rank Correlation Coefficient (Cont)
• Suppose we have n individual whose ranks
according to characteristic A is x1 , x 2 ,, x n and
according to characteristic B is y1 , y 2 ,, y n ,
then the coefficient of correlation computed
using Spearman’s rank correlation coefficient
given by:
n
6 d i2
rs = 1 − i =1

n (n − 1)
2

where d i = xi − y i is the difference of rank xi and y i .

35
Rank Correlation Coefficient (cont)
• Steps in computation of rank correlation coefficient.
1. For the first variable X, to the highest (smallest) value
give rank 1, the second highest (smallest) value give
rank 2,etc, up to n
2. On the same line similarly rank the variable Y, from
rank 1 - n.
3. Find di=rank xi - rank yi.
4. Compute di2 and evaluate rs.
5. Still we have -1≤ rs ≤1, and the interpretations we
have for r also holds here.

36
Example on rank correlation
In a small panel of food and restaurant critics in a
large city participated in a study where they rated
a sample of dinner wines. The wines were served
without the taster being aware of their identities
or prices. The following table shows the panel’s
collective ratings of the wines (1 as best testing),
along the wholesale prices. Find the rank of
correlation coefficient.
Wine A B C D E F G H I J K
Taste rank 7 4 11 9 3 1 6 2 10 8 5
Price in $ 8.95 9.89 6.95 8.49 10.75 10.60 7.95 9.60 7.45 6.65 9.79
37
Solution
We rank the given data as follows.
Taste Price From the rank
Wine rank (xi) Price ($) rank (yi) di=xi - yi di2 correlation coefficent
formula
A 7 8.95 6 1 1 n
B 4 9.89 3 1 1 6 d i2
C 11 6.95 10 1 1 rs = 1 − i =1

n (n − 1)
2
D 9 8.49 7 2 4
E 3 10.75 1 2 4 6(36)
= 1−
F 1 10.60 2 -1 1 11(112 − 1)
G 6 7.95 8 -2 4 216
= 1−
H 2 9.60 5 -3 9 1320
I 10 7.45 9 1 1 = 0.836
J 8 6.65 11 -3 9 There is strong positive
K 5 9.79 4 1 1 correlation between the
taste rank and wine price.
Total 36
38
Coefficient of Determination


r2 = 1−
 (y i − yˆ ) 2
= 1 −
SSE TSS − ESS RSS
= = =
 ( yˆ i − y) 2
(y i − y) 2
TSS TSS TSS (y i − y) 2

39

You might also like