University of Padua
Statistics for Management
Simple Linear Regression (1)
Omar Paccagnella
Department of Statistical Sciences
University of Padua
omar.paccagnella@unipd.it
http://www.stat.unipd.it/~paccagnella
Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 1/ 23
University of Padua
Introduction
What happens if:
• Data are time-oriented
• There is more than 1 variable
Unit Shop surface Weekly sales (1000 e)
1 95 43.2
2 144 132.0
3 210 155.0
4 156 76.0
5 188 100.9
6 321 187.4
7 250 185.0
8 115 60.7
9 178 82.9
10 105 61.3
Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 2/ 23
University of Padua
Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 3/ 23
University of Padua
Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 4/ 23
University of Padua
Introduction
• A scatter diagram or scatter plot (that is a two-dimensional graph)
may help to show a relationship between two variables
• Is this relationship linear? Is this positive or negative? If linear, could
we summarise such relationship by fitting a straight line through the
data points (like a trend for a time series)?
Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 5/ 23
University of Padua
Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 6/ 23
University of Padua
Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 7/ 23
University of Padua
Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 8/ 23
University of Padua
The Correlation Coefficient
It measures the extent to which 2 variables (usually called X and Y ) are
linearly related to each other
(in other words, the strength of such linear relationship)
• In the population (that contains all possible values of the pair X − Y
of interest): ρ
• In the (random) sample drawn from this population: r
Often the 2 variables are measured in different units (in the example,
square metres & e); nevertheless it is important to measure the extent to
which X and Y are related
Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 9/ 23
University of Padua
The Correlation Coefficient
• Standardize the variables (constructing the Z -scores)
(X − X) ( Y − Y)
ZX = ZY =
SX SY
I X: average value (mean) of X
I SX : standard deviation of X
I n: number of units
• Calculate the mean cross product of Z -scores:
P
1 X (X − X)(Y − Y)
r = ZX ZY = qP q
n−1
(X − X)2 (Y − Y)2
I Correlation and not causation is measured
I −1 ≤ r ≤ 1 (check the value of r = 0.8927 in the previous example)
Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 10/ 23
University of Padua
Fitting a Straight Line
• Could we find a straight line that is able to summarise the pattern of
all X − Y data points?
• Could we fit the best straight line?
• Could we exploit this best straight line to forecast unknown (future?)
values of the variable of interest (Y )?
Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 11/ 23
University of Padua
Fitting a Straight Line
• We may introduce a mathematical procedure to calculate both the
Y -intercept and the slope of the best-fitting straight line.
• Since many straight lines can be calculated, the most common
approach to determine the best fit is the method of least squares
(OLS - Ordinary Least Squares)
The best fitting line is the one that minimises the sum of
the squared distances between the data points and the line itself,
as measured in the vertical (Y ) direction
Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 12/ 23
University of Padua
Fitting a Straight Line
Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 13/ 23
University of Padua
Fitting a Straight Line
• In the population the straight line may be mathematically defined as:
Y = β 0 + β1 X
• In the sample the straight line may be mathematically defined as:
Y = b0 + b1 X
where b0 and b1 are estimates of the true (but unknown) population
intercept and slope.
According to the values of the sample, we can predict the Y values in the
fitted line
Ŷ = b̂0 + b̂1 X
Ŷ is the value that we would observe if X ’s were on the line
Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 14/ 23
University of Padua
Least Squares
The idea behind this method is that the line will be appropriate to
describe the relationship under investigation if the observed values are
closed to the straight line.
The distance between observed and fitted values is the residual:
ei = Yi − Ŷi = Yi − b0 − b1 Xi
According to the OLS criterium, the values of b0 and b1 are chosen in
order to minimise the sum of squared errors (residuals):
n
X n
X
SSE = f (b0 , b1 ) = ei2 = (Yi − b0 − b1 Xi )2
i =1 i =1
Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 15/ 23
University of Padua
Least Squares
First order conditions (that is the derivates of f (b0 , b1 ) with respect to b0
and b1 ) are applied to minimise SSE. Using little calculi:
Pn
i =1 (X − X)(Y − Y)
b̂1 = Pn 2
i =1 (X − X)
b̂0 = Y − b̂1 X
Least Squares slope is related to sample correlation coefficient, so that
qP
n
i = 1 (Y − Y)2
b̂1 = qP r
n
i =1 (X − X)2
Hence, b̂1 and r are proportional to one another and have the same sign.
Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 16/ 23
University of Padua
The linear regression model
According to the least squares criterion, we have the identity
Observation = Fit + Residual
formally
Y = Ŷ + (Y − Ŷ )
• The fit represents the overall pattern in the data
• The residuals represent deviations from the pattern
Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 17/ 23
University of Padua
The linear regression model
Observed data is a sample of observations on an underlying relation that
holds in the population.
For all values of X , the observed values for Y are identically distributed
around a mean µ that depends linearly on X :
µy = β0 + β1 X
As X changes, the means of the distributions of the possible values of Y
lie along a straight line. This is the so-called
population regression line
• Observed values for Y vary because of the presence of unknown
(and unmeasured) factors.
• This variation is the same for all X ’s values and is measured by the
standard deviation σ .
• The distance between a Y value and its mean is called error ().
Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 18/ 23
University of Padua
The linear regression model
In the simple linear regression model:
• Y is the response or dependent variable.
• X is the controlled or explanatory (independent) variable.
• The dependent variable is the sum of its mean and a random
deviation () from this mean.
• Deviations represent variation in Y due to unobserved factors that
prevent the pair (X , Y ) values from lying exactly on the straight line.
The population regression line may be defined as:
Y = β0 + β1 X +
Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 19/ 23
University of Padua
Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 20/ 23
University of Padua
The linear regression model
The sample regression line may be regarded as an estimate of the
population regression line,
µY = β0 + β1 X
and the residuals e = Y − Ŷ may be regarded as estimates of the error
components
Therefore:
Y = b0 + b1 X + e
Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 21/ 23
University of Padua
Some notes
• We may also write
Cov (X , Y )
b1 =
Var (X )
if Var (X ) 6= 0
• b1 = 0 if and only if Cov (X , Y ) = 0, that is the two variables are
linearly independent
• Cov (X , Y ) provides the sign of b1 estimate
• The regression line always passes through the means of X and Y
Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 22/ 23
University of Padua
Steps of a Linear Regression Analysis
• Hypothesis on the linear functional relationship between the variable
of interest and the other variable(s)
• Estimation of the parameters of this functional relationship,
based on the available sample data
• Statistical testing of model estimates and goodness of fit
• Robustness checks on the main assumptions of the linear
regression model
Statistics for Management, a.y. 2018/19 - Simple linear regression (1) 23/ 23