[go: up one dir, main page]

0% found this document useful (0 votes)
11 views24 pages

Lecture 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views24 pages

Lecture 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Correlation : introduction

- A frequent need in data analysis to study the extent to which two quantities are
related, e.g. how they tend to change in the same direction
- Two types of relationships: linear and non-linear

© Nicolas Navet University of Luxembourg 3


Correlation : identifying it
- Visual exploration allows to detect both linear and non-linear correlations
- There are statistical tests/measures of correlation:
- Pearson Coefficient Correlation (PCC) is the standard measure of linear correlation, but it
does not capture non-linear relationships
- Spearman's rank correlation coefficient an alternative to PCC that is sensitive to non-linear
relationships, but it only captures how the relationship between two variables can be
described using a monotonic function (so more general than a linear relationship but not
any relationship)
- The BDS (Brock-Dechert-Scheinkman) test is a statistical test designed to detect non-linear
dependences in time series data (out of scope) – “The BDS test enables us to reject the null
hypothesis that price changes are i.i.d.” (just a heads-up!)
- “Correlation does not imply causation”: two variables moving together does not
necessarily mean one causes the other. There may be hidden factors. Ex: the
rate of violent crimes have been known to increase when ice cream sales do

© Nicolas Navet University of Luxembourg 4


Visual exploration of the relationship
between two variables
© Airbus
- Predictor: a predictor variable (aka independent variable) is used to predict the
value of another variable (the outcome or dependent variable). Is height a predictor
of weight?
- Let’s use the (synthetic!) data set at
https://www.kaggle.com/datasets/burnoutminer/heights-and-weights-dataset

© Nicolas Navet University of Luxembourg 5


© Airbus

Data visualization

© Nicolas Navet University of Luxembourg


Visual exploration of the correlation with a
scatterplot
- X is the value of the predictor, here the height © Airbus

- Y is the value of the second variable (the outcome or dependent variable), the
weight

Is there an apparent
correlation between
height and weight?

© Nicolas Navet University of Luxembourg 7


Linear Regression: Weight vs Height
- A regression model predicts / explains a
dependent variable based on independent © Airbus

variables
- Weight (kg)=0.55×Height (cm)−37.46
- For each additional centimeter in height,
the model predicts an increase of
approximately 0.55 kg in weight
- Computed using the LinearRegression model from
the scikit-learn library in Python, using Ordinary Least
Squares (OLS) → finds the best-fitting line by
minimizing the sum of squared differences between
actual weight values and predicted weight values
Quality of regression depends on the difference
- OLS is sensitive to outliers and not always the best between the observed and predicted values → R-
approach squared value, here 0.25 : “25% of the variance in
weight can be explained by the height”

© Nicolas Navet University of Luxembourg 8


“Hexagonal binning” plot for large data sets
- Similar as scatterplot but
- The color intensity indicates the density of data points in each hexagon. © Airbus

© Nicolas Navet University of Luxembourg 9


Correlation : introduction

https://xkcd.com/552/

Why does the person on the


right answer “Maybe” ?

© Nicolas Navet University of Luxembourg 10


Covariance between two variables
- Covariance measures the direction of the linear
relationship between two variables:
- A positive covariance indicates that the variables
tend to move in the same direction, while a negative
covariance shows that they move in opposite
directions (i.e., as one variable increases, the other
tends to decrease)
- Covariance alone does not give a clear measure of
the strength of this relationship since its value
depends on the scales of the variables Calculate the covariance of
a={1,2,3,4,5}
- Pearson’s correlation coefficient removes the effects of and b1={6,7,8,9,10}, then
the scale of the variables, and thus makes it easier to of a and b2={-6,7,-8,9,-10}
interpret the strength and direction of the relationship Hint: use np.cov() in Python

© Nicolas Navet University of Luxembourg 11


Covariance VS Variance
- Variance, or Standard Deviation (StdDev), measures how much a single variable
deviates from its mean. It provides a measure of the dispersion. Unit of StdDev
is the unit of the variable itself.
- Covariance measures how two variables move together. It shows the directional
relationship between the two variables. Covariance mixes two units and is
sensitive to the scales of the units.
- Nb: Cov(X,Y) = Cov(Y,X), therefore PCC(X,Y) = PCC(Y,X) (just a heads-up!)

© Nicolas Navet University of Luxembourg 12


Pearson Correlation Coefficient (PCC)
- Most popular measure of the
linear correlation between two variables
- Cauchy-Schwarz inequality:
- The denominator normalizes the covariance
by the product of the standard deviations of
𝑋 and 𝑌 and ensures that the correlation
coefficient 𝑟 ranges between −1 and 1.

- Direction of correlation:
- Positive: as one variable increases the other also increases. Example: PCC
of height and weight data set used earlier is 0,5, which indicates a
moderate positive correlation
- Negative: as one variable increases, the other decreases.

© Nicolas Navet University of Luxembourg 13


PCC : Strength of the correlation

© Nicolas Navet University of Luxembourg 14


Practice exercise
Use Python to calculate the Pearson correlation coefficient between variables
X and Y. Is there a significant correlation between age and glucose levels?

Hint: use np.corrcoef()


or

© Nicolas Navet University of Luxembourg 15


Pearson Correlation coefficient (PCC)
- PCC value is always between -1 (total negative correlation) and 1 (total
positive correlation). A value of 0 indicates no linear correlation.

Important: the PCC


reveals the existence and
direction of a linear
relationship but not the
slope of that relationship.
The latter can be derived
by fitting a linear model to
the data.

Figure from Wikipedia

© Nicolas Navet University of Luxembourg 16


Estimate the PCC

Figure from Wikipedia

© Nicolas Navet University of Luxembourg 17


Estimate the PCC

Figure from Wikipedia Relationships in the third row are not linear!

© Nicolas Navet University of Luxembourg 18


Limitations of the PCC
All 3 sets of data have the same PCC of 0.816

Perfect linear relationship Obvious non-linear


Figures from Wikipedia
except for one outlier relationship
✓ The PCC is a summary statistics and cannot replace The PCC only tells the extent to
visual examination of the data. which a relationship can be
✓ Prior to evaluating the PCC, we should look at a approximated by a linear
scatterplot to ensure that it makes sense to proceed. relationship

© Nicolas Navet University of Luxembourg 19


Correlation is not Causation
- Data shows that when ice cream sales
increase, drowning incidents also increase.
- This does not mean that ice cream
consumption causes more drowning
incidents!
- There is a hidden factor: hot weather
- Other techniques are needed to
demonstrate causation such as
- Randomized Controlled Trials: the
‘treatment group’ receives the treatment
while ‘control group’ do not
- Controlled experiments: adjusting one
variable, e.g. temperature, keeping all
other others constant

© Nicolas Navet University of Luxembourg 20


Correlation Matrix and heatmap
- A table that shows the correlation coefficients between multiple variables
- A heatmap is a data visualization technique that represents data values as colors

✓ Stocks are correlated means that


there is a relationship between
their price movements, e.g.
companies in the same industry:
e.g., GAFA stock correlations range
from 0.7 to 0.9!
✓ The color and intensity of the color
show the direction and the
strength of the correlation, e.g.
blue here means there is a
negative correlation.

© Nicolas Navet University of Luxembourg 21


Autocorrelation aka serial correlation
- Correlation between the time series and a lagged version of itself: useful to
find patterns in data and gives a first estimate of the extent to which a time
series is predictable
- Calculate the linear dependence of a time series with itself at two points in
time, e.g. between time t and t+k (k is the “lag”)

© Nicolas Navet University of Luxembourg 22


Autocorrelation : visual representation
- A correlogram is a graphical representation of the autocorrelations of a
time series at different lags. It helps visualize how the values of the series
are related to each other across time.

Dashed lines represent


confidence intervals (here at
95%). Autocorrelations that fall
outside these intervals are Are the
considered statistically autocorrelations
significant: “there is only a 5% strong or weak ?
chance that this correlation
occurred by random chance”
How many such values do we
have here?

© Nicolas Navet University of Luxembourg 23


Partial Autocorrelation
- Imagine we have a time series of daily temperatures and want to estimate the
correlations at lag 2
- The temperature on Day 𝑡 is strongly correlated – let’s say 0.8 - with the
temperature on Day 𝑡−1 because weather typically changes gradually
- The temperature on Day 𝑡 is also correlated with Day 𝑡−2, let’s say 0.6.
However, it could be partly because Day t−2 is strongly correlated with Day t−1.
- Explanation: Day 𝑡−2 influences Day 𝑡−1 and Day 𝑡−1 in turn influences Day 𝑡.
Therefore, the observed autocorrelation at lag 2 is a mixture of the direct
influence of Day 𝑡−2 on Day 𝑡 and the indirect influence through Day 𝑡−1.

Partial autocorrelation eliminates the effect of correlations at lower lags


in the example above, lag 1 correlation is removed

© Nicolas Navet University of Luxembourg 24


Partial Autocorrelation Function (PACF)
Nb: the autocorrelation at lag 0 will always be 1 Autocorrelation at lag 0 not shown here

Autocorrelation is Partial autocorrelation is


strong for lags 3,6,9,.. strong only at lag 3

In Python: Outside our scope - the PACF allows to find a model


from statsmodels.graphics.tsaplots import plot_pacf that fits the data: e.g., white noise (no correlation),
plot_pacf(series, lags=50) moving average or “Autoregressive Models (AR)”: the
current value of the time series can be fully explained
by a linear combination of its previous values.

© Nicolas Navet University of Luxembourg 25


Beyond the PCC : Rank-Based Correlation Measures
- Rank-based correlation coefficients – e.g., Spearman’s Rank Correlation and Kendall’s Tau -
measure the strength and direction of a relationship between two variables based on their
ranks, rather than their actual numerical values – unlike the PCC
- Example: determine if there is a relationship between the grade in one course and the grade in
another: “If student A has the highest score in both subjects and student B has the second-highest
in both, the ranks would match perfectly, leading to a high correlation”
- Ranking example: “the smaller the value, the higher its rank”
- the ranks of Dataset A: [13, 15, 11, 14, 12] are [3, 5, 1, 4, 2]
- the ranks of Dataset B: [30, 50, 10, 40, 20] are [2, 5, 1, 4, 3]
- Some situations where they should be used:
- When the dataset contains outliers, such as unusual stock prices, or is highly skewed
(unbalanced, not evenly spread around the mean)
- Small sample size
- Non-linear relationships
- Data can be ranked but not numerically compared, such as performance of athletes in
different disciplines, grades expressed in letters, …

© Nicolas Navet University of Luxembourg 26

You might also like