CHAPTER 3
Correlation Analysis
3.1 Correlation
Correlation analysis is a statistical technique used to quantify the direction and strength
of association between two variables. An estimate of the correlation between two
variables is called the correlation coefficient, denoted by r . In correlation analysis, we
estimate a sample correlation coefficient. The sample correlation coefficient, ranges
between -1 and +1. When the correlation coefficient is positive, higher levels of one
variable are associated with higher levels of the other and when it is negative, higher
levels of one variable are associated with lower levels of the other. The sign of the
correlation coefficient indicates the direction of the association while the magnitude of
the correlation coefficient indicates the strength of the association.
3.2 Estimation methods
There are two main approaches commonly used in computing the sample correlation
coefficient. These are
• The Pearson product-moment correlation coefficient: for estimating linear
correlation among continuous variables and
• The spearman’s rank correlation coefficient: for estimating non-linear correlation.
(a) Pearson product-moment
This is used for quantitative data measured on interval or ratio scale. For given
observations x and y , It the Pearson product-moment is defined by:
( x − x )( y − y )
i i
r= i =1
n n
( x − x ) ( y − y )
2 2
i i
i =1 i =1
1
(b) Spearman’s correlation coefficient
This applied to the ranks instead of the actual data. It is defined by:
n
6 di2
r = 1− i =1
n ( n − 1)
2
Where
di = difference between pair of rank
n = total number of observations or pairs of ranks
Example 3.1
Consider the Table below provides the score for math and English for 10 students in a
test.
Math 5 0 3 1 2 2 5 3 5 4
English 1 2 1 3 3 1 3 1 6 2
Scores( )
a) Find the correlation coefficient between the math grade and English grade and
interpret it.
b) Test the hypothesis that there is no linear association among the variables at 5% level
of significance.
2
Solution
n n n
( x − x )( y − y ) = 6 ( x − x ) = 28 ( y − y ) =22.1
2 2
a) y = 3 and x = 2.3 i i i i
i =1 i =1 i =1
Math (x-xbar)(y-
Scores( English bar) (x-xbar) (y-bar)
5 1 -2.6 4 1.69
0 2 0.9 9 0.09
3 1 0 0 1.69
1 3 -1.4 4 0.49
2 3 -0.7 1 0.49
2 1 1.3 1 1.69
5 3 1.4 4 0.49
3 1 0 0 1.69
5 6 7.4 4 13.69
4 2 -0.3 1 0.09
Xbar=3 Ybar=2.3 6 28 22.1
6
r= = 0.2412
28 22.1
The correlation coefficient between the math grade and English grade is only about 24%
which seem not to be a strong linear relationship present. About 24% of the time when
math grade increases, the grade for English also increases and vice ver.
b)
1. H 0 : = 0 versus H1 : 0
2. The distribution of the population correlation is unknown; thus it follows the t-
distribution
3The two tail critical value at 5% is t −2.306 or t 2.306
n − 2, n − 2,1−
2 2
r n−2 0.2417 10 − 2
4. Compute the test Statistics using T = , hence T = = 0.7260
1− r 2
1 − 0.2417 2
3
5. Decision: Do not reject H0 because T 2.306 and conclude that there is no linear
association between the variables.
Example 3.2
Find the rank correlation coefficient of the table below:
x 69 66 68 73 71 74 71 69
y 163 153 185 186 157 220 190 185S
Solution.
Let the rank values of x and y be Rx and Ry respectively.
x y Rx Ry d= Rx- Ry di2
69 163 3.5 3 0.5 0.25
66 153 1 1 0 0
68 185 2 4.5 -2.5 6.25
73 186 7 6 1 1
71 157 5.5 2 3.5 12.25
74 22 8 8 0 0
71 190 5.5 7 -1.5 2.25
69 185 3.5 4.5 -1 1
d
i =1
i
2
= 0.25 + 0 + 6.25 + 1 + 12.25 + 0 + 2.25 + 1 = 23
Hence, the rank correlation coefficient is given by:
n
6 di2
6(23)
r = 1− i =1
=1 − = 0.7262
n ( n − 1)
2
8(64 − 1)
Exercise 3.1
(a) Consider the Table of grades below:
Mathematics grade 70 92 80 74 65 83
English grade 74 84 63 87 78 90
4
(i) Compute and interpret the correlation coefficient if the grades of students are
selected at random.
(ii) Test the hypothesis that there is no linear association among the variables at 5%
level of significance.
(b) The Statistics Consulting Center at Virginia Tech analyzed data on normal
woodchucks for the Department of Veterinary Medicine. The variables of interest were
body weight in grams and heart weight in grams. It was desired to develop a linear
regression equation in order to determine if there is a significant linear relationship
between heart weight and total body weight.
Weight (kg) 2.75 2.15 4.41 5.52 3.21 4.32 2.31 4.3 3.71
Chest size 29.5 26.3 32.2 36.5 27.2 27.7 28.3 30.3 28.7
(cm)
(i) Calculate r and interpret it.
(ii) Test the null hypothesis that ρ = 0 against the alternative that ρ > 0. Use α=0.01 level.
(ii) What percentage of the variation in infant chest sizes is explained by difference in
weight? {NB: square r to obtain this percentage}.
(iii) Use the spearman’s rank approach to compute r using the data in question 3.1.
5
CHAPTER 4
Correlation Analysis
4.1 Regression Analysis
Regression analysis is a statistical technique in which we use observed data to relate a
variable of interest (or response) variable, to one or more independent (or predictor)
variables. A regression analysis in which the response/variable of interest/dependent
variable depends one independent or predictor variable is called a simple regression
model. The main objective of regression analysis is to build a regression model or
prediction equation that can be used to describe, predict, and control the dependent
variable on the basis of the independent variables. (Bowerman et al,2001)
4.2 Scatter Diagram
One way to explore the relationship between a dependent variable y and an independent
variable (denoted x ) is to make a scatter diagram, or scatter plot, of y versus x . First,
data concerning the two variables are observed in pairs. To construct the scatter plot, each
value of y is plotted against its corresponding value of x .If y and x are related, the plot
shows us the direction of the relationship. That is, y could be positively related to x ( y
increases as x increases) or y could be negatively related to x ( y decreases as
x increases). (Bowerman et al, 2001). The figures below show some examples of scatter
plots.
Figure 1(a) Figure 1(b)
6
Figure 1(c) Figure 1(d)
4.3 Sample linear regression model
The simple linear regression model assumes that the relationship between the dependent
variable, denoted y and the independent variable, denoted x, can be approximated by a
straight line. We can tentatively decide whether there is an approximate straight line
relationship between y and x by making a scatter diagram, or scatter plot of x and y by
making a scatter diagram, or scatter plot of y versus x. The simple linear regression
model is given by:
yi = yi xi + i
yi = 0 + 1 xi + i
where
0 = the intercept parameter on the y -axis of the model
1 = the slope or the differential tangent of the linear model.
yi
= the value of the response variable in the ith observation.
xi
= the known value of the predictor variable in the ith observation
i
= the random error or noise term which accounts for errors due to chance and
neglected factors assumed not important.
7
4.4 Assumptions Underlying the Simple Linear Model.
(i) The value of yi are random and independent of each other.
(ii) The probability distribution of i N (0, 2 )
(iii) The mean of the error term is zero i.e. E( i ) = 0
(iv) The variance of the error term is constant i.e. Var( i )= 2 xi
(v) v)The random errors i and j are independent i.e. cov( i , j )= 0 for i j
From the above assumptions, we establish the following:
E ( yi ) = E ( 0 + 1 xi + i ) = 0 + 1 x
Var ( yi ) = Var ( 0 + 1 xi + i )
= Var (i ) = 2 xi
(a) yi N (0 + 1 xi , 2 )
(b) cov( yi , y j ) = 0
4.5 The Least Square Point Estimates of Linear Regression Model
We seek to find the estimates 0 and 1 respectively by minimizing the total sum of
squares error (SSE):
n n n
SSE = i2 = ( yi2 −yˆ )2 = ( yi −ˆ0 − ˆ1 xi )2 (1)
i =1 i =1 i =1
Differentiating the SSE with respect to 0 and 0 we have
( SSE ) n
= −2 ( yi −ˆ0 − ˆ1 xi )2
ˆ0 i =1
(2)
( SSE ) n
= −2 ( yi − 0 − 1 xi ) xi
ˆ ˆ 2
ˆ1 i =1
8
By setting the partial derivatives to zero and rearranging the terms, obtain the normal
equations:
n n
nˆ0 + ˆ1 xi = yi
i =1 i =1
n n n (3)
0 xi + 1 xi = xi yi
ˆ ˆ
i =1 i =1 i =1
Which may be solved simultaneously to yield:
n
1 n n n
n xi yi − ( xi )( yi ) ( xi − x )( yi − y )
n i =1
ˆ1 = i =1 n n
i =1
= i =1 n (4)
n xi − ( xi )
2 2
( xi − x ) 2
i =1 i =1 i =1
and
n n
y − ˆ x i 1 i
ˆ0 = i =1 i =1
= y − ˆ1 x (5)
n
4.6 Analysis of Variance
The application of Analysis of variance (ANOVA) in regression analysis is based on the
partitioning of the total variation and its degree of freedom into components. By the
partitioning of the total variation,
n n
( y − y ) = [( y − yˆ ) + ( yˆ − y )]
i =1
i
2
i =1
i i i
2
n n n
= ( yi − yˆi )2 + ( yˆi − y )2 + 2 ( yi − yˆi )( yˆi − y )
i =1 i =1 i =1
n n n
( y − y ) = ( y − yˆ ) + ( yˆ − y )
i =1
i
2
i =1
i i
2
i =1
i
2
i.e. Total variation = unexplained variation + Explained Variation
SST = SSE + SSR
• The SST is a measure of dispersion of the total variance in the observed values,
yi .
9
• The SSR also measures the amount of the total variance in the observed values of
yi that is accounted for by the model.
• The SSE is a measure of dispersion of the observed values yi about the
regression line.
It can be shown that
SSR
r2 =
SST
This r 2 is called coefficient of determination which is the explained variation expressed
as a fraction of the total variation. It determines the percentage of the total variation in
yi accounted for by the model.
SSR
We can deduce from r2 = that
SST
i) SSR = r 2 SST
ii) SSE = (1 − r 2 )SST
SS
Also 0 r 2 1 , r 2 = ˆ12 xx and r 2 = ˆ1 ˆ0 .
SST
Table 4.1 The Analysis of Variance(ANOVA) for the regression model.
Sources of Variation Sum of Squares Degree of freedom Mean square F-Ratio
Regression SSR = ˆ1 SS xy 1
MSR =
SSR
F=
MSR
1 MSE
Residual Error SSE = SS yy − ˆ1 SS xy n − 2 MSE =
SSE
n−2
Total SST = SS yy n −1
From the ANOVA Table the following valid conclusions can be drawn:
10
i) E ( SST ) = (n − 1) 2 + ˆ12 SS xx
ii) E ( MSR) = 2 + ˆ12 SS xx
iii) E(SSE) = (n − 2) 2 under the null hypothesis, H 0 : ˆ0 = 0
iv) E (MST ) = 2
E (MSR) = 2
E (MSE ) = 2
However MSE is unbiased estimator for 2 whether or not x and y are related ( i.e
ˆ1 = 0 or not). If ,then ˆ1 = 0 then E (MSR) 2 since ˆ12 SS xx 0 . Thus ,for testing
whether or not ˆ1 = 0 a comparison of MSR and MSE is made. If MSR and MSE are
of the same order of magnitude, it will suggest that ˆ1 = 0 . On the other hand, if the
MSR MSE this would suggest that ˆ1 0 .
These two mean squares ( MSR and MSE ) form the basic idea underlying the ANOVA
test of the overall regression model.
4.7 The F- Ratio Test
The ANOVA generally provides highly useful test for regression models (and other linear
statistical models).
(a) The hypotheses are
H 0 : ˆ1 = 0 versus H1 : ˆ1 0 at α level of significance
MSR
(b) The test –statistic is F = F (1,n − 2) which approaches 1 when 1 = 0 and
MSE
bigger than 1 when 1 0
11
(c) Decision Rule: Reject H 0 : 1 = 0 if F F (1,n−2)
4.8 Coefficient of determination
The coefficient of determination denoted by R2 measures the proportion of the total
variability in the dependent variable (y) that is explained by the independent variable (x).
SSR SSTO − SSE SSE
R2 = = = 1−
SSTO SSTO SSTO
Note the follow:
(a) 1. 0 R 2 1
(b) If all the data points fall exactly on the regression line having a non-zero slope,
then R 2 = 1 .
(c) If ˆ1 = 0 then R 2 = 0.
(d) The square root of R2 gives you the correlation coefficient where the direction or
+ r if 1 = +
the sign is determined by the direction of ̂1 . i.e. r = R 2 where .
−r if 1 = −
4.9 Pitfalls and limitations associated with regression and Correlation analysis
(a) In regression analysis a value of Y cannot be legitimately estimated if the value of
X is outside the range of values that served as the basis for the regression
equation.
(b) If the estimate of Y involves the prediction of a result that has not yet occurred,
the historical data that served as the basis for the regression equation may not be
relevant for future events.
(c) The use of a prediction or a confidence interval is based on the assumption that
the conditional distributions of Y, and thus of the residuals, are normal and have
equal variances.
(d) A significant correlation coefficient does not necessarily indicate causation, but
rather may indicate a common linkage to other events.
(e) A significant correlation is not necessarily an important correlation. Given a large
sample, a correlation of, say, r=+0.10 can be significantly different from 0 at a
= 0.05 . Yet the coefficient of determination of r2 = 0.01 for this example
12
indicates that only 1 percent of the variance in Y is statistically explained by
knowing X.
(f) The interpretation of the coefficients of correlation and determination is based on
the assumption of a bivariate normal distribution for the population and, for each
variable, equal conditional variances.
(g) For both regression and correlation analysis, a linear model is assumed. For a
relationship that is curvilinear, a transformation to achieve linearity may be
available. Another possibility is to restrict the analysis to the range of values
within which the relationship is essentially linear.
CHAPTER 3
4.10 Worked Examples
Example 4.3
Use the estimated model below to describe the relationship between x and y
yˆ = 3.829633 − 0.903643 x .
Example 4.3
Suppose an analyst takes a random sample of 10 recent truck shipments made by a
company and records the distance in miles and delivery time to the nearest half-day from
the time that the shipment was made available for pick-up. Use the Table below to
answer the following questions.
(a) Construct a scatter plot and use it to determine if a linear regression will be
appropriate.
(b) Determine the least-squares regression equation for the data.
13
(c) What is the nature of the relationship between distance and delivery time?
Motivate your answer.
(d) Interpret the estimated value for β0.
(e) Determine the number of days it will take a shipment to arrive if the total miles is
1000.
(f) Determine the coefficient of variation and interpret it.
(g) Construct an ANOVA table to represent the data.
(h) Perform a hypothesis test to determine the significance of the estimated model at
α=0.05.
(i) Compute the correlation coefficient and use it to test the hypothesis H 0 : = 0
versus H1 : 0 at 0.05 level of significance
Solution
(a) Looking at the scatter plot below, the points seem to form a straight line, therefore
linear regression may be appropriate model for the data.
(b) We will need the following data since we were not given.
10 10 10
x = 762 , y = 2.85 , ( xi − x ) = 1297860 , ( yi − y ) = 18.525 ( x − x )( y − y ) = 4653
2 2
i i
i =1 i =1 i =1
14
n
( x − x )( y − y )
i i
4653
ˆ1 = i =1
n
= = 0.003585
(x − x ) 2 12977860
i
i =1
ˆ0 = y − ˆ1 x = 2.85 − 0.003585 762 = 0.118129
Therefore, the estimated regression line is yˆ = 0.118129 + 0.003585x
(c) There is a direct or positive relationship between delivery time and distance because
β1 is positive.
(d) All things being equal, a unit increase in distance will result in 0.0036 increase in
delivery time.
ŷ = 0.118 + 0.0036 (1000 ) = 3.70 days .
(e) SSE = SS yy − ˆ1 SS xy
SSE = 18.525 − ( 0.003585)( 4653) = 1.844
10
Note from the variance decomposition that SST = ( yi − y ) = 18.525 ,
2
i =1
SSE 1.844
r2 = 1− = 1− = 0.90 .
SST 18.525
About 90% of the total variations in delivery time is explained by the distance.
(f) ANOVA table
Sources of Sum of Squares Degree of freedom Mean square F-Ratio
Variation
Regression 16.681 1 16.681 16.681
F= = 72.396
0.2305
Residual Error 1.844 8 1.844
= 0.2305
10 − 2
Total 18.525 10
(g) H 0 : ˆ1 = 0 versus H1 : ˆ1 0 at 0.05 level of significance
F = 72.396
15
F ,(1,n − 2) = F0.05,(1,8) = 5.32
(h) Since F 5.32 we reject H 0 : ˆ1 = 0 and conclude the model is significant.
(i) H 0 : = 0 versus H1 : 0
The two tail critical value at 5% is t −2.306 or t 2.306
n − 2, n − 2,1−
2 2
r = 0.9 = 0.95 since ̂1 is positive, r = 0.95
r n−2 0.95 10 − 2
T= , hence T = = 8.61
1− r2 1 − 0.952
We reject H0 because T 2.306 and conclude that there is a significant relationship
between distance and delivery time.
16