0% found this document useful (0 votes)

11 views24 pages

Lecture 2

Uploaded by

personalspotify007

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views24 pages

Lecture 2

Uploaded by

personalspotify007

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Correlation : introduction

- A frequent need in data analysis to study the extent to which two quantities are
related, e.g. how they tend to change in the same direction
- Two types of relationships: linear and non-linear

© Nicolas Navet University of Luxembourg 3

Correlation : identifying it
- Visual exploration allows to detect both linear and non-linear correlations
- There are statistical tests/measures of correlation:
- Pearson Coefficient Correlation (PCC) is the standard measure of linear correlation, but it
does not capture non-linear relationships
- Spearman's rank correlation coefficient an alternative to PCC that is sensitive to non-linear
relationships, but it only captures how the relationship between two variables can be
described using a monotonic function (so more general than a linear relationship but not
any relationship)
- The BDS (Brock-Dechert-Scheinkman) test is a statistical test designed to detect non-linear
dependences in time series data (out of scope) – “The BDS test enables us to reject the null
hypothesis that price changes are i.i.d.” (just a heads-up!)
- “Correlation does not imply causation”: two variables moving together does not
necessarily mean one causes the other. There may be hidden factors. Ex: the
rate of violent crimes have been known to increase when ice cream sales do

© Nicolas Navet University of Luxembourg 4

Visual exploration of the relationship
between two variables
© Airbus
- Predictor: a predictor variable (aka independent variable) is used to predict the
value of another variable (the outcome or dependent variable). Is height a predictor
of weight?
- Let’s use the (synthetic!) data set at
https://www.kaggle.com/datasets/burnoutminer/heights-and-weights-dataset

© Nicolas Navet University of Luxembourg 5

Data visualization

© Nicolas Navet University of Luxembourg

Visual exploration of the correlation with a
scatterplot
- X is the value of the predictor, here the height © Airbus

- Y is the value of the second variable (the outcome or dependent variable), the
weight

Is there an apparent
correlation between
height and weight?

© Nicolas Navet University of Luxembourg 7

Linear Regression: Weight vs Height
- A regression model predicts / explains a
dependent variable based on independent © Airbus

variables
- Weight (kg)=0.55×Height (cm)−37.46
- For each additional centimeter in height,
the model predicts an increase of
approximately 0.55 kg in weight
- Computed using the LinearRegression model from
the scikit-learn library in Python, using Ordinary Least
Squares (OLS) → finds the best-fitting line by
minimizing the sum of squared differences between
actual weight values and predicted weight values
Quality of regression depends on the difference
- OLS is sensitive to outliers and not always the best between the observed and predicted values → R-
approach squared value, here 0.25 : “25% of the variance in
weight can be explained by the height”

© Nicolas Navet University of Luxembourg 8

“Hexagonal binning” plot for large data sets
- Similar as scatterplot but
- The color intensity indicates the density of data points in each hexagon. © Airbus

© Nicolas Navet University of Luxembourg 9

Correlation : introduction

https://xkcd.com/552/

Why does the person on the

right answer “Maybe” ?

© Nicolas Navet University of Luxembourg 10

Covariance between two variables
- Covariance measures the direction of the linear
relationship between two variables:
- A positive covariance indicates that the variables
tend to move in the same direction, while a negative
covariance shows that they move in opposite
directions (i.e., as one variable increases, the other
tends to decrease)
- Covariance alone does not give a clear measure of
the strength of this relationship since its value
depends on the scales of the variables Calculate the covariance of
a={1,2,3,4,5}
- Pearson’s correlation coefficient removes the effects of and b1={6,7,8,9,10}, then
the scale of the variables, and thus makes it easier to of a and b2={-6,7,-8,9,-10}
interpret the strength and direction of the relationship Hint: use np.cov() in Python

© Nicolas Navet University of Luxembourg 11

Covariance VS Variance
- Variance, or Standard Deviation (StdDev), measures how much a single variable
deviates from its mean. It provides a measure of the dispersion. Unit of StdDev
is the unit of the variable itself.
- Covariance measures how two variables move together. It shows the directional
relationship between the two variables. Covariance mixes two units and is
sensitive to the scales of the units.
- Nb: Cov(X,Y) = Cov(Y,X), therefore PCC(X,Y) = PCC(Y,X) (just a heads-up!)

© Nicolas Navet University of Luxembourg 12

Pearson Correlation Coefficient (PCC)
- Most popular measure of the
linear correlation between two variables
- Cauchy-Schwarz inequality:
- The denominator normalizes the covariance
by the product of the standard deviations of
𝑋 and 𝑌 and ensures that the correlation
coefficient 𝑟 ranges between −1 and 1.

- Direction of correlation:
- Positive: as one variable increases the other also increases. Example: PCC
of height and weight data set used earlier is 0,5, which indicates a
moderate positive correlation
- Negative: as one variable increases, the other decreases.

© Nicolas Navet University of Luxembourg 13

PCC : Strength of the correlation

© Nicolas Navet University of Luxembourg 14

Practice exercise
Use Python to calculate the Pearson correlation coefficient between variables
X and Y. Is there a significant correlation between age and glucose levels?

Hint: use np.corrcoef()

© Nicolas Navet University of Luxembourg 15

Pearson Correlation coefficient (PCC)
- PCC value is always between -1 (total negative correlation) and 1 (total
positive correlation). A value of 0 indicates no linear correlation.

Important: the PCC

reveals the existence and
direction of a linear
relationship but not the
slope of that relationship.
The latter can be derived
by fitting a linear model to
the data.

Figure from Wikipedia

© Nicolas Navet University of Luxembourg 16

Estimate the PCC

Figure from Wikipedia

© Nicolas Navet University of Luxembourg 17

Estimate the PCC

Figure from Wikipedia Relationships in the third row are not linear!

© Nicolas Navet University of Luxembourg 18

Limitations of the PCC
All 3 sets of data have the same PCC of 0.816

Perfect linear relationship Obvious non-linear

Figures from Wikipedia
except for one outlier relationship
✓ The PCC is a summary statistics and cannot replace The PCC only tells the extent to
visual examination of the data. which a relationship can be
✓ Prior to evaluating the PCC, we should look at a approximated by a linear
scatterplot to ensure that it makes sense to proceed. relationship

Correlation is not Causation
- Data shows that when ice cream sales
increase, drowning incidents also increase.
- This does not mean that ice cream
consumption causes more drowning
incidents!
- There is a hidden factor: hot weather
- Other techniques are needed to
demonstrate causation such as
- Randomized Controlled Trials: the
‘treatment group’ receives the treatment
while ‘control group’ do not
- Controlled experiments: adjusting one
variable, e.g. temperature, keeping all
other others constant

Correlation Matrix and heatmap
- A table that shows the correlation coefficients between multiple variables
- A heatmap is a data visualization technique that represents data values as colors

✓ Stocks are correlated means that

there is a relationship between
their price movements, e.g.
companies in the same industry:
e.g., GAFA stock correlations range
from 0.7 to 0.9!
✓ The color and intensity of the color
show the direction and the
strength of the correlation, e.g.
blue here means there is a
negative correlation.

Autocorrelation aka serial correlation
- Correlation between the time series and a lagged version of itself: useful to
find patterns in data and gives a first estimate of the extent to which a time
series is predictable
- Calculate the linear dependence of a time series with itself at two points in
time, e.g. between time t and t+k (k is the “lag”)

Autocorrelation : visual representation
- A correlogram is a graphical representation of the autocorrelations of a
time series at different lags. It helps visualize how the values of the series
are related to each other across time.

Dashed lines represent

confidence intervals (here at
95%). Autocorrelations that fall
outside these intervals are Are the
considered statistically autocorrelations
significant: “there is only a 5% strong or weak ?
chance that this correlation
occurred by random chance”
How many such values do we
have here?

Partial Autocorrelation
- Imagine we have a time series of daily temperatures and want to estimate the
correlations at lag 2
- The temperature on Day 𝑡 is strongly correlated – let’s say 0.8 - with the
temperature on Day 𝑡−1 because weather typically changes gradually
- The temperature on Day 𝑡 is also correlated with Day 𝑡−2, let’s say 0.6.
However, it could be partly because Day t−2 is strongly correlated with Day t−1.
- Explanation: Day 𝑡−2 influences Day 𝑡−1 and Day 𝑡−1 in turn influences Day 𝑡.
Therefore, the observed autocorrelation at lag 2 is a mixture of the direct
influence of Day 𝑡−2 on Day 𝑡 and the indirect influence through Day 𝑡−1.

Partial autocorrelation eliminates the effect of correlations at lower lags

in the example above, lag 1 correlation is removed

Partial Autocorrelation Function (PACF)
Nb: the autocorrelation at lag 0 will always be 1 Autocorrelation at lag 0 not shown here

Autocorrelation is Partial autocorrelation is

strong for lags 3,6,9,.. strong only at lag 3

In Python: Outside our scope - the PACF allows to find a model

from statsmodels.graphics.tsaplots import plot_pacf that fits the data: e.g., white noise (no correlation),
plot_pacf(series, lags=50) moving average or “Autoregressive Models (AR)”: the
current value of the time series can be fully explained
by a linear combination of its previous values.

Beyond the PCC : Rank-Based Correlation Measures
- Rank-based correlation coefficients – e.g., Spearman’s Rank Correlation and Kendall’s Tau -
measure the strength and direction of a relationship between two variables based on their
ranks, rather than their actual numerical values – unlike the PCC
- Example: determine if there is a relationship between the grade in one course and the grade in
another: “If student A has the highest score in both subjects and student B has the second-highest
in both, the ranks would match perfectly, leading to a high correlation”
- Ranking example: “the smaller the value, the higher its rank”
- the ranks of Dataset A: [13, 15, 11, 14, 12] are [3, 5, 1, 4, 2]
- the ranks of Dataset B: [30, 50, 10, 40, 20] are [2, 5, 1, 4, 3]
- Some situations where they should be used:
- When the dataset contains outliers, such as unusual stock prices, or is highly skewed
(unbalanced, not evenly spread around the mean)
- Small sample size
- Non-linear relationships
- Data can be ranked but not numerically compared, such as performance of athletes in
different disciplines, grades expressed in letters, …

Measures of Association
No ratings yet
Measures of Association
14 pages
Measures of Association
No ratings yet
Measures of Association
15 pages
Unit 3 Covariance and Correlation
No ratings yet
Unit 3 Covariance and Correlation
7 pages
Lecture 05
No ratings yet
Lecture 05
20 pages
Correlation
No ratings yet
Correlation
19 pages
Correlation Analysis Guide
No ratings yet
Correlation Analysis Guide
12 pages
MIS BA 20232024 Notes Chapter02
No ratings yet
MIS BA 20232024 Notes Chapter02
8 pages
Understanding Correlation & Covariance
No ratings yet
Understanding Correlation & Covariance
12 pages
Unit 4
No ratings yet
Unit 4
10 pages
Fds Unit III Notes
No ratings yet
Fds Unit III Notes
23 pages
MetNum1 2023 1 Week 13
No ratings yet
MetNum1 2023 1 Week 13
70 pages
Module 2 - Section 4 (Linear Regression) - 11
No ratings yet
Module 2 - Section 4 (Linear Regression) - 11
20 pages
Lecture 7
No ratings yet
Lecture 7
65 pages
Correlation and Covariance
No ratings yet
Correlation and Covariance
11 pages
Correlation
No ratings yet
Correlation
13 pages
Block 5 MS 08 Correlation
No ratings yet
Block 5 MS 08 Correlation
13 pages
Business Data Analysis Essentials
No ratings yet
Business Data Analysis Essentials
11 pages
Correlation
No ratings yet
Correlation
9 pages
Unit Iii Poriyan Notes
No ratings yet
Unit Iii Poriyan Notes
33 pages
Statistics Module 3hejeiehhwwhgsysysudhhdbb
No ratings yet
Statistics Module 3hejeiehhwwhgsysysudhhdbb
44 pages
ABM 401 Lesson 12
No ratings yet
ABM 401 Lesson 12
14 pages
Unit 4 Correlation Analysis
No ratings yet
Unit 4 Correlation Analysis
21 pages
Correlation Analysis PDF
No ratings yet
Correlation Analysis PDF
30 pages
Correlation and Regression
No ratings yet
Correlation and Regression
11 pages
ECN 652 Handout 9 Student
No ratings yet
ECN 652 Handout 9 Student
46 pages
Correlation N Regression
No ratings yet
Correlation N Regression
25 pages
Review: I Am Examining Differences in The Mean Between Groups
100% (2)
Review: I Am Examining Differences in The Mean Between Groups
44 pages
Bi Variate 1
No ratings yet
Bi Variate 1
75 pages
Correlation Analysis
No ratings yet
Correlation Analysis
49 pages
LSE Data Analysis M6U1 Notes
No ratings yet
LSE Data Analysis M6U1 Notes
25 pages
Second Stats Packet 24
No ratings yet
Second Stats Packet 24
100 pages
Chapter 3 - Regression
No ratings yet
Chapter 3 - Regression
8 pages
Online Class Etiquettes and Precautions For The Students
No ratings yet
Online Class Etiquettes and Precautions For The Students
49 pages
Correlation vs. Causation
No ratings yet
Correlation vs. Causation
13 pages
FODS Unit-3
No ratings yet
FODS Unit-3
25 pages
Regression and Correlation
No ratings yet
Regression and Correlation
37 pages
Statistics Regression Final Project
100% (2)
Statistics Regression Final Project
12 pages
Correlation of Experimental Data
No ratings yet
Correlation of Experimental Data
8 pages
Program 2
No ratings yet
Program 2
9 pages
Unit III Describing Relationships
No ratings yet
Unit III Describing Relationships
56 pages
Correlation Coefficient
No ratings yet
Correlation Coefficient
8 pages
Unit 8
No ratings yet
Unit 8
16 pages
Correlation
No ratings yet
Correlation
6 pages
Correlation and Regression Analysis
No ratings yet
Correlation and Regression Analysis
11 pages
Unit 2
No ratings yet
Unit 2
44 pages
Unit3 Fha
No ratings yet
Unit3 Fha
9 pages
Unit III Notes
No ratings yet
Unit III Notes
31 pages
6 Correlation and Linear Regression
No ratings yet
6 Correlation and Linear Regression
32 pages
Correlation and Regression Analysis
No ratings yet
Correlation and Regression Analysis
100 pages
Unit 3 Fod
No ratings yet
Unit 3 Fod
21 pages
Correlation and Regression Analysis
No ratings yet
Correlation and Regression Analysis
34 pages
Correlation New
No ratings yet
Correlation New
37 pages
Correlation BMLT
No ratings yet
Correlation BMLT
5 pages
BACS3713 MIS Tutorial 4 - Answer - New
No ratings yet
BACS3713 MIS Tutorial 4 - Answer - New
17 pages
Graph-Based Dependency Parsing
No ratings yet
Graph-Based Dependency Parsing
54 pages
Instruction Manual: Sweepmaster P900 R (6502.10) Sweepmaster B900 R (6502.20)
No ratings yet
Instruction Manual: Sweepmaster P900 R (6502.10) Sweepmaster B900 R (6502.20)
66 pages
375 Full
No ratings yet
375 Full
11 pages
Personal Bank Statement Summary
No ratings yet
Personal Bank Statement Summary
2 pages
Pride and Prejudice Traducido Resuelto 2.es - en
No ratings yet
Pride and Prejudice Traducido Resuelto 2.es - en
42 pages
BaZi Calculator - Mẹ Ngọc
No ratings yet
BaZi Calculator - Mẹ Ngọc
2 pages
ATC FIN 1 Buyersamaypur5
No ratings yet
ATC FIN 1 Buyersamaypur5
34 pages
Toxicity of Heavy Metals
No ratings yet
Toxicity of Heavy Metals
19 pages
Child Psychiatric Evaluation Form
100% (1)
Child Psychiatric Evaluation Form
10 pages
21st Century Literature: From The Philippines and From The World
No ratings yet
21st Century Literature: From The Philippines and From The World
25 pages
Projet MJP 2500w FAQS
No ratings yet
Projet MJP 2500w FAQS
3 pages
One Word For A Phrase One Word For A Phrase
No ratings yet
One Word For A Phrase One Word For A Phrase
2 pages
Christian Wolff 1st Edition Edition Michael Hicks Full Chapters Included
No ratings yet
Christian Wolff 1st Edition Edition Michael Hicks Full Chapters Included
133 pages
Listening 1
No ratings yet
Listening 1
3 pages
Handbook PWD
No ratings yet
Handbook PWD
38 pages
NEET/JEE English Exam Paper 2024
No ratings yet
NEET/JEE English Exam Paper 2024
10 pages
Algorithm2e PDF
No ratings yet
Algorithm2e PDF
60 pages
Final Year Project Guide
No ratings yet
Final Year Project Guide
2 pages
Engr. Hany Elnagar CV - Sr. Electrical Engineer (18!08!2020) - Compressed - 1
No ratings yet
Engr. Hany Elnagar CV - Sr. Electrical Engineer (18!08!2020) - Compressed - 1
5 pages
Ek-4-C Yurt Dişi İlaç Fi̇yat Li̇stesi̇
No ratings yet
Ek-4-C Yurt Dişi İlaç Fi̇yat Li̇stesi̇
3 pages
Ateneo Quick Reference Guide English
100% (1)
Ateneo Quick Reference Guide English
2 pages
Certificate For Serving Defence Personnel 2025
No ratings yet
Certificate For Serving Defence Personnel 2025
2 pages
T S 1675378142 Older Learners Dealing With Regret Activity Ver 1
No ratings yet
T S 1675378142 Older Learners Dealing With Regret Activity Ver 1
4 pages
Jurnal Dedikasi Pendidikan: Universitas Abulyatama
No ratings yet
Jurnal Dedikasi Pendidikan: Universitas Abulyatama
10 pages
Science and Civilisation in China - ARTISANS and ENGINEERS (Vol 4-2) - Joseph Needham (PP 10-50)
100% (1)
Science and Civilisation in China - ARTISANS and ENGINEERS (Vol 4-2) - Joseph Needham (PP 10-50)
78 pages
Ge3 - Reviewer
No ratings yet
Ge3 - Reviewer
15 pages
DC To DC Buck Converter Working Principle
No ratings yet
DC To DC Buck Converter Working Principle
7 pages
Trade - 2018 - Class-9-10 Computer & ICT-1 Web (WI)
No ratings yet
Trade - 2018 - Class-9-10 Computer & ICT-1 Web (WI)
358 pages
The Theme of Guilt in and Then There Were None
No ratings yet
The Theme of Guilt in and Then There Were None
5 pages

Lecture 2

Uploaded by

Lecture 2

Uploaded by

Correlation : introduction

© Nicolas Navet University of Luxembourg 3

© Nicolas Navet University of Luxembourg 4

© Nicolas Navet University of Luxembourg 5

© Nicolas Navet University of Luxembourg

© Nicolas Navet University of Luxembourg 7

© Nicolas Navet University of Luxembourg 8

© Nicolas Navet University of Luxembourg 9

Why does the person on the

© Nicolas Navet University of Luxembourg 10

© Nicolas Navet University of Luxembourg 11

© Nicolas Navet University of Luxembourg 12

© Nicolas Navet University of Luxembourg 13

© Nicolas Navet University of Luxembourg 14

Hint: use np.corrcoef()

© Nicolas Navet University of Luxembourg 15

Important: the PCC

Figure from Wikipedia

© Nicolas Navet University of Luxembourg 16

Figure from Wikipedia

© Nicolas Navet University of Luxembourg 17

© Nicolas Navet University of Luxembourg 18

Perfect linear relationship Obvious non-linear

© Nicolas Navet University of Luxembourg 19

© Nicolas Navet University of Luxembourg 20

✓ Stocks are correlated means that

© Nicolas Navet University of Luxembourg 21

© Nicolas Navet University of Luxembourg 22

Dashed lines represent

© Nicolas Navet University of Luxembourg 23

Partial autocorrelation eliminates the effect of correlations at lower lags

© Nicolas Navet University of Luxembourg 24

Autocorrelation is Partial autocorrelation is

In Python: Outside our scope - the PACF allows to find a model

© Nicolas Navet University of Luxembourg 25

© Nicolas Navet University of Luxembourg 26

You might also like