Statistics for Data Science -1
Statistics for Data Science -1
Lecture 4.7: Association between two numerical
variables-Correlation
Usha Mohan
Indian Institute of Technology Madras
1/ 16
Statistics for Data Science -1
Association between numerical variables
Measuring asssociation: Correlation
Learning objectives
1. Understand the measure of correlation.
2. Interpret correlation to quantify the strength of association
between two numerical variables.
2/ 16
Statistics for Data Science -1
Association between numerical variables
Measuring asssociation: Correlation
Correlation
3/ 16
Statistics for Data Science -1
Association between numerical variables
Measuring asssociation: Correlation
Correlation
I A more easily intepreted measure of linear association between
two numerical variables is correlation
3/ 16
Statistics for Data Science -1
Association between numerical variables
Measuring asssociation: Correlation
Correlation
I A more easily intepreted measure of linear association between
two numerical variables is correlation
I It is derived from covariance.
I To find the correlation between two numerical variables x and
y divide the covariance between x and y by the product of the
standard deviations of x and y . The Pearson correlation
coefficient, r , between x and y is given by
3/ 16
Statistics for Data Science -1
Association between numerical variables
Measuring asssociation: Correlation
Correlation
I A more easily intepreted measure of linear association between
two numerical variables is correlation
I It is derived from covariance.
I To find the correlation between two numerical variables x and
y divide the covariance between x and y by the product of the
standard deviations of x and y . The Pearson correlation
coefficient, r , between x and y is given by
r=
3/ 16
Statistics for Data Science -1
Association between numerical variables
Measuring asssociation: Correlation
Correlation
I A more easily intepreted measure of linear association between
two numerical variables is correlation
I It is derived from covariance.
I To find the correlation between two numerical variables x and
y divide the covariance between x and y by the product of the
standard deviations of x and y . The Pearson correlation
coefficient, r , between x and y is given by
n
X
(xi − x̄)(yi − ȳ )
i=1
r=Ã Ã =
Xn Xn
(xi − x̄)2 (yi − ȳ )2
i=1 i=1
3/ 16
Statistics for Data Science -1
Association between numerical variables
Measuring asssociation: Correlation
Correlation
I A more easily intepreted measure of linear association between
two numerical variables is correlation
I It is derived from covariance.
I To find the correlation between two numerical variables x and
y divide the covariance between x and y by the product of the
standard deviations of x and y . The Pearson correlation
coefficient, r , between x and y is given by
n
X
(xi − x̄)(yi − ȳ )
i=1 cov (x, y )
r=Ã Ã =
Xn Xn sx sy
(xi − x̄)2 (yi − ȳ )2
i=1 i=1
3/ 16
Statistics for Data Science -1
Association between numerical variables
Measuring asssociation: Correlation
4/ 16
Statistics for Data Science -1
Association between numerical variables
Measuring asssociation: Correlation
Remark
The units of the standard deviations cancel out the units of
covariance
4/ 16
Statistics for Data Science -1
Association between numerical variables
Measuring asssociation: Correlation
Remark
The units of the standard deviations cancel out the units of
covariance
Remark
It can be shown that the correlation measure always lies between
-1 and +1
4/ 16
Statistics for Data Science -1
Association between numerical variables
Measuring asssociation: Correlation
Correlation: Example 1
5/ 16
Statistics for Data Science -1
Association between numerical variables
Measuring asssociation: Correlation
Correlation: Example 1
Age Height sq.Devn of x sq.Devn of y
x y (xi − x̄)2 (yi − ȳ )2 (xi − x̄)(yi − ȳ )
1 75 4 309.76 35.2
2 85 1 57.76 7.6
3 94 0 1.96 0
4 101 1 70.56 8.4
5 108 4 237.16 30.8
10 677.2 82
I sx = 1.58, sy = 13.01
I r = √ 82 20.5
OR 1.58×13.01 = 0.9964
10×677.2
5/ 16
Statistics for Data Science -1
Association between numerical variables
Measuring asssociation: Correlation
Correlation: Example 2
Age Price sq. Devn of x sq. Devn of y
x y (xi − x̄)2 (yi − ȳ )2 (xi − x̄)(yi − ȳ )
1 6 4 4 -4
2 5 1 1 -1
3 4 0 0 0
4 3 1 1 -1
5 2 4 4 -4
10 10 -10
I sx = 1.58, sy = 1.58
I r = √ −10√ OR 1.58×1.58
−2.5
= −1
10× 10
6/ 16
Statistics for Data Science -1
Association between numerical variables
Measuring asssociation: Correlation
Correlation using google sheets
Step 1 The function CORREL(series1, series2) will return the value
of correlation.
For example: If the data corresponding to x-variable (series1) is in
cell A2:A6 and data corresponding to y -variable (series2) is in cells
B2:B6; then CORREL(A2:A6,B2:B6) returns the value of the
Pearson Correlation coefficient.
7/ 16
Statistics for Data Science -1
Association between numerical variables
Measuring asssociation: Correlation
Section summary
8/ 16
Statistics for Data Science -1
Association between numerical variables
Measuring asssociation: Correlation
Section summary
1. Introduced measure of correlation.
2. Interpreting correlation between variables.
8/ 16
Statistics for Data Science -1
Association between numerical variables
Fitting a line
Learning objectives
1. Summarize the linear association between two variables using
the equation of a line.
2. Understand the significance of R 2
9/ 16
Statistics for Data Science -1
Association between numerical variables
Fitting a line
Summarizing the association with a line
10/ 16
Statistics for Data Science -1
Association between numerical variables
Fitting a line
Summarizing the association with a line
I The strength of linear association between the variables was
measured using the measures of Covariance and Correlation.
10/ 16
Statistics for Data Science -1
Association between numerical variables
Fitting a line
Summarizing the association with a line
I The strength of linear association between the variables was
measured using the measures of Covariance and Correlation.
I The linear association can be described using the equation of
a line.
10/ 16
Statistics for Data Science -1
Association between numerical variables
Fitting a line
Equation of line using google sheets
11/ 16
Statistics for Data Science -1
Association between numerical variables
Fitting a line
Equation of line using google sheets
Step 1 Open the scatter plot
Step 2 Under customize tab, click on series
Step 3 Click on trendline
Step 4 Under label tab, click on use equation, and click the show R 2
button.
11/ 16
Statistics for Data Science -1
Association between numerical variables
Fitting a line
Example 1: Size versus Price of homes: Equation
Equation of the line: Price = 30.5 × Size + 36;
R 2 = 0.647; r = 0.804
12/ 16
Statistics for Data Science -1
Association between numerical variables
Fitting a line
13/ 16
Statistics for Data Science -1
Association between numerical variables
Fitting a line
Example 2: Age versus Price of cars: Equation
Equation of the line: Price = −0.694 × Age + 9.03;
R 2 = 0.855; r = −0.9247
14/ 16
Statistics for Data Science -1
Association between numerical variables
Fitting a line
Example 3: Size versus Price of homes: Equation
Equation of the line: Price = 7.77 × Size + 130;
R 2 = 0.022; r = 0.149
15/ 16
Statistics for Data Science -1
Association between numerical variables
Fitting a line
Section summary
1. Equation of a line describing linear relationship between two
variables.
2. Interpreting slope, R 2 of the line.
16/ 16