Correlation : introduction
- A frequent need in data analysis to study the extent to which two quantities are
related, e.g. how they tend to change in the same direction
- Two types of relationships: linear and non-linear
© Nicolas Navet University of Luxembourg 3
Correlation : identifying it
- Visual exploration allows to detect both linear and non-linear correlations
- There are statistical tests/measures of correlation:
- Pearson Coefficient Correlation (PCC) is the standard measure of linear correlation, but it
does not capture non-linear relationships
- Spearman's rank correlation coefficient an alternative to PCC that is sensitive to non-linear
relationships, but it only captures how the relationship between two variables can be
described using a monotonic function (so more general than a linear relationship but not
any relationship)
- The BDS (Brock-Dechert-Scheinkman) test is a statistical test designed to detect non-linear
dependences in time series data (out of scope) – “The BDS test enables us to reject the null
hypothesis that price changes are i.i.d.” (just a heads-up!)
- “Correlation does not imply causation”: two variables moving together does not
necessarily mean one causes the other. There may be hidden factors. Ex: the
rate of violent crimes have been known to increase when ice cream sales do
© Nicolas Navet University of Luxembourg 4
Visual exploration of the relationship
between two variables
© Airbus
- Predictor: a predictor variable (aka independent variable) is used to predict the
value of another variable (the outcome or dependent variable). Is height a predictor
of weight?
- Let’s use the (synthetic!) data set at
https://www.kaggle.com/datasets/burnoutminer/heights-and-weights-dataset
© Nicolas Navet University of Luxembourg 5
© Airbus
Data visualization
© Nicolas Navet University of Luxembourg
Visual exploration of the correlation with a
scatterplot
- X is the value of the predictor, here the height © Airbus
- Y is the value of the second variable (the outcome or dependent variable), the
weight
Is there an apparent
correlation between
height and weight?
© Nicolas Navet University of Luxembourg 7
Linear Regression: Weight vs Height
- A regression model predicts / explains a
dependent variable based on independent © Airbus
variables
- Weight (kg)=0.55×Height (cm)−37.46
- For each additional centimeter in height,
the model predicts an increase of
approximately 0.55 kg in weight
- Computed using the LinearRegression model from
the scikit-learn library in Python, using Ordinary Least
Squares (OLS) → finds the best-fitting line by
minimizing the sum of squared differences between
actual weight values and predicted weight values
Quality of regression depends on the difference
- OLS is sensitive to outliers and not always the best between the observed and predicted values → R-
approach squared value, here 0.25 : “25% of the variance in
weight can be explained by the height”
© Nicolas Navet University of Luxembourg 8
“Hexagonal binning” plot for large data sets
- Similar as scatterplot but
- The color intensity indicates the density of data points in each hexagon. © Airbus
© Nicolas Navet University of Luxembourg 9
Correlation : introduction
https://xkcd.com/552/
Why does the person on the
right answer “Maybe” ?
© Nicolas Navet University of Luxembourg 10
Covariance between two variables
- Covariance measures the direction of the linear
relationship between two variables:
- A positive covariance indicates that the variables
tend to move in the same direction, while a negative
covariance shows that they move in opposite
directions (i.e., as one variable increases, the other
tends to decrease)
- Covariance alone does not give a clear measure of
the strength of this relationship since its value
depends on the scales of the variables Calculate the covariance of
a={1,2,3,4,5}
- Pearson’s correlation coefficient removes the effects of and b1={6,7,8,9,10}, then
the scale of the variables, and thus makes it easier to of a and b2={-6,7,-8,9,-10}
interpret the strength and direction of the relationship Hint: use np.cov() in Python
© Nicolas Navet University of Luxembourg 11
Covariance VS Variance
- Variance, or Standard Deviation (StdDev), measures how much a single variable
deviates from its mean. It provides a measure of the dispersion. Unit of StdDev
is the unit of the variable itself.
- Covariance measures how two variables move together. It shows the directional
relationship between the two variables. Covariance mixes two units and is
sensitive to the scales of the units.
- Nb: Cov(X,Y) = Cov(Y,X), therefore PCC(X,Y) = PCC(Y,X) (just a heads-up!)
© Nicolas Navet University of Luxembourg 12
Pearson Correlation Coefficient (PCC)
- Most popular measure of the
linear correlation between two variables
- Cauchy-Schwarz inequality:
- The denominator normalizes the covariance
by the product of the standard deviations of
𝑋 and 𝑌 and ensures that the correlation
coefficient 𝑟 ranges between −1 and 1.
- Direction of correlation:
- Positive: as one variable increases the other also increases. Example: PCC
of height and weight data set used earlier is 0,5, which indicates a
moderate positive correlation
- Negative: as one variable increases, the other decreases.
© Nicolas Navet University of Luxembourg 13
PCC : Strength of the correlation
© Nicolas Navet University of Luxembourg 14
Practice exercise
Use Python to calculate the Pearson correlation coefficient between variables
X and Y. Is there a significant correlation between age and glucose levels?
Hint: use np.corrcoef()
or
© Nicolas Navet University of Luxembourg 15
Pearson Correlation coefficient (PCC)
- PCC value is always between -1 (total negative correlation) and 1 (total
positive correlation). A value of 0 indicates no linear correlation.
Important: the PCC
reveals the existence and
direction of a linear
relationship but not the
slope of that relationship.
The latter can be derived
by fitting a linear model to
the data.
Figure from Wikipedia
© Nicolas Navet University of Luxembourg 16
Estimate the PCC
Figure from Wikipedia
© Nicolas Navet University of Luxembourg 17
Estimate the PCC
Figure from Wikipedia Relationships in the third row are not linear!
© Nicolas Navet University of Luxembourg 18
Limitations of the PCC
All 3 sets of data have the same PCC of 0.816
Perfect linear relationship Obvious non-linear
Figures from Wikipedia
except for one outlier relationship
✓ The PCC is a summary statistics and cannot replace The PCC only tells the extent to
visual examination of the data. which a relationship can be
✓ Prior to evaluating the PCC, we should look at a approximated by a linear
scatterplot to ensure that it makes sense to proceed. relationship
© Nicolas Navet University of Luxembourg 19
Correlation is not Causation
- Data shows that when ice cream sales
increase, drowning incidents also increase.
- This does not mean that ice cream
consumption causes more drowning
incidents!
- There is a hidden factor: hot weather
- Other techniques are needed to
demonstrate causation such as
- Randomized Controlled Trials: the
‘treatment group’ receives the treatment
while ‘control group’ do not
- Controlled experiments: adjusting one
variable, e.g. temperature, keeping all
other others constant
© Nicolas Navet University of Luxembourg 20
Correlation Matrix and heatmap
- A table that shows the correlation coefficients between multiple variables
- A heatmap is a data visualization technique that represents data values as colors
✓ Stocks are correlated means that
there is a relationship between
their price movements, e.g.
companies in the same industry:
e.g., GAFA stock correlations range
from 0.7 to 0.9!
✓ The color and intensity of the color
show the direction and the
strength of the correlation, e.g.
blue here means there is a
negative correlation.
© Nicolas Navet University of Luxembourg 21
Autocorrelation aka serial correlation
- Correlation between the time series and a lagged version of itself: useful to
find patterns in data and gives a first estimate of the extent to which a time
series is predictable
- Calculate the linear dependence of a time series with itself at two points in
time, e.g. between time t and t+k (k is the “lag”)
© Nicolas Navet University of Luxembourg 22
Autocorrelation : visual representation
- A correlogram is a graphical representation of the autocorrelations of a
time series at different lags. It helps visualize how the values of the series
are related to each other across time.
Dashed lines represent
confidence intervals (here at
95%). Autocorrelations that fall
outside these intervals are Are the
considered statistically autocorrelations
significant: “there is only a 5% strong or weak ?
chance that this correlation
occurred by random chance”
How many such values do we
have here?
© Nicolas Navet University of Luxembourg 23
Partial Autocorrelation
- Imagine we have a time series of daily temperatures and want to estimate the
correlations at lag 2
- The temperature on Day 𝑡 is strongly correlated – let’s say 0.8 - with the
temperature on Day 𝑡−1 because weather typically changes gradually
- The temperature on Day 𝑡 is also correlated with Day 𝑡−2, let’s say 0.6.
However, it could be partly because Day t−2 is strongly correlated with Day t−1.
- Explanation: Day 𝑡−2 influences Day 𝑡−1 and Day 𝑡−1 in turn influences Day 𝑡.
Therefore, the observed autocorrelation at lag 2 is a mixture of the direct
influence of Day 𝑡−2 on Day 𝑡 and the indirect influence through Day 𝑡−1.
Partial autocorrelation eliminates the effect of correlations at lower lags
in the example above, lag 1 correlation is removed
© Nicolas Navet University of Luxembourg 24
Partial Autocorrelation Function (PACF)
Nb: the autocorrelation at lag 0 will always be 1 Autocorrelation at lag 0 not shown here
Autocorrelation is Partial autocorrelation is
strong for lags 3,6,9,.. strong only at lag 3
In Python: Outside our scope - the PACF allows to find a model
from statsmodels.graphics.tsaplots import plot_pacf that fits the data: e.g., white noise (no correlation),
plot_pacf(series, lags=50) moving average or “Autoregressive Models (AR)”: the
current value of the time series can be fully explained
by a linear combination of its previous values.
© Nicolas Navet University of Luxembourg 25
Beyond the PCC : Rank-Based Correlation Measures
- Rank-based correlation coefficients – e.g., Spearman’s Rank Correlation and Kendall’s Tau -
measure the strength and direction of a relationship between two variables based on their
ranks, rather than their actual numerical values – unlike the PCC
- Example: determine if there is a relationship between the grade in one course and the grade in
another: “If student A has the highest score in both subjects and student B has the second-highest
in both, the ranks would match perfectly, leading to a high correlation”
- Ranking example: “the smaller the value, the higher its rank”
- the ranks of Dataset A: [13, 15, 11, 14, 12] are [3, 5, 1, 4, 2]
- the ranks of Dataset B: [30, 50, 10, 40, 20] are [2, 5, 1, 4, 3]
- Some situations where they should be used:
- When the dataset contains outliers, such as unusual stock prices, or is highly skewed
(unbalanced, not evenly spread around the mean)
- Small sample size
- Non-linear relationships
- Data can be ranked but not numerically compared, such as performance of athletes in
different disciplines, grades expressed in letters, …
© Nicolas Navet University of Luxembourg 26