BUS9040 – DECISION ANALYSIS FOR
MANAGERS
SEMINAR 3
Predictive Analytics
MULTIPLE
REGRESSION
Types of Data Analytics
• Descriptive – “what happened?” - looks at data to examine,
understand, and describe something that’s already happened.
• Diagnostic analytics - “Why did this happen?” - goes deeper than
descriptive analytics by seeking to understand the “why” behind
what happened.
• Predictive – “what might happen in the future” - relies on
historical data, past trends, and assumptions to answer questions
about what will happen in the future.
• Prescriptive – “what should we do next” - identifies specific
actions an individual or organization should take to reach future
targets or goals.
Introduction
Regression looks at the relationships between quantitative
(continuous) variables. These might be
Advertising expenditure and sales revenue
Tourist arrival numbers and oil prices
Electricity consumption and temperature
• Auction price and number of bidders etc.
• We can examine the relationship between our two variables
through a scatter diagram (plot one against the other).
• In this seminar , we will look at how to examine such
relationships more formally.
Example: Price of wine
The price of a bottle of wine is thought to depend on many
factors, such as its age, the quality of the grapes used to produce
it, the amount of rainfall during the growing season, where the
wine was produced, etc.
The table below shows the price of 10 randomly selected bottles of
wine from an online wine merchant. Also shown is the age of each
wine selected.
Table 1: Price and age of wine
Bottle 1 2 3 4 5 6 7 8 9 10
Age (X ) 3 12 5 3 2 21 3 2 2 12 1 10 4
Price (Y ) 4.50 12.95 6.50 4.99 7.50 14.95 8.25 3.95 18.99 10.00
Is there any relationship between age and price of wine? If so, can you
DESCRIBE this relationship?
Dr. Lee Fawcett
Figure 1: Scatter plot of price and age of wine
Quantifying the relationship: Correlation
Scatterplots such as Figure 1 can be difficult to interpret using
words alone, since different people might say different things.
Some might think there is a moderate/fairly strong relationship
between X and Y here, whilst others might conclude that there is a
relatively weak relationship between these two variables.
Interpreting such relationships with words alone can be subjective;
quantifying such relationships numerically can circumvent this
problem of subjectivity.
Dr. Lee Fawcett
Quantifying the relationship: Correlation
The correlation coefficient r always lies between -1 and +1.
If r is close to +1, there is a strong positive linear relationship
If r is close to -1 there is a strong negative relationship
If r is close to zero, there is no linear relationship between the
variables.
Note that r ≈ 0 does not imply no relationship at all, simply no
linear relationship.
Quantifying the relationship: Correlation
• Figure 2 depicts the relationship between X and Y. These relationships were
quantified using correlation coefficients (r) as follows.
• Note that r is almost zero in the bottom right figure yet there apparently
strong non-linear relationship. Again, r ≈ 0 does not imply no
relationship, simply no linear relationship.
•
Figure 2: Scatter plots of two variables X and Y
r=1 r = −0.899
r = 0.699 r = 0.064
In Summary;
Simple Linear Regression
Simple linear regression
A correlation analysis helps to establish whether or not there is a
linear relationship between two variables. However, it doesn’t allow
us to use this linear relationship.
Regression analysis allows us to use the linear relationship between
variables. For example, with a regression analysis, we can predict
the value of one variable given the value of another.
To perform a regression analysis, we must assume that
the scatter plot of the two variables (roughly) shows a
straight line, and
the spread in the Y –direction is roughly constant with X .
Recall Figure 1: Scatter plot of price and age of wine
The Regression Equation
The simple (univariate) linear regression model
is given by
Y = β0 + β 1 X + ε
where
Y is the response variable (also called dependent variable) and
X is the explanatory variable (also called independent variable).
β0 represents the intercept of the regression line (the point where
the line “cuts” the Y –axis),
β1 represents the slope of the regression line (i.e. how steep the
line is), and
ε is known as “random error”
In practice, we assume ε is zero, and so the only things we need to find
are α and β. But how?
Note: Instead of β0 and β1, the regression equation may also be written as:
Y = α + βX + ε
Simple linear regression
• For the wine price data, we can find the values of α and β
a n d h e n c e t h e line of best fit
• In practice we use a computer (e.g. Excel) to find values of α
and β a n d h e n c e the line of best fit. (next slide)
• That is, the computer calculates α and β and gives us the regression
equation
• Recall that the regression equation is in the form of Y = β0 + β1 X
Simple linear regression
• Using Excel (see Excel output below), we obtain the regression
equation for the wine data as: Y = 3.905 + 1.467X
• The plot in Figure 3 (next slide) shows the scatter diagram for the
wine data again, but now with the regression line superimposed. We
can use the regression line (or equation) to make predictions of the
dependent variable (price of wine in this case)
Figure 3: Scatter plot of price and age of wine (with regression line)
Modelling the relationship: simple linear regression
We can use the estimated regression equation to make predictions
of wine price given a certain age.
for example, suppose we produce a bottle of wine that has been
ageing for 4 12 years. How much should we sell it for?
We can take a reading from our graph, or, more accurately, use our
regression equation!:
Y = 3.903 + 1.467 × 4.5
= 10.505,
i.e. about £10.50.
Dr. Lee Fawcett
Modelling the relationship: simple linear regression
Note that we should only use our regression equation to make
predictions using X –values that lie within the range of the data
observed.
So, for example, we should not use this regression equation to
estimate the selling price of a bottle of wine that has been ageing
for 12 years.
We can also interpret the regression equation in the following way:
for every one year increase in age, the selling price of a bottle of
wine increases by about £1.47.
Multiple Regression
Multiple Linear Regression: Wine Example
• Recall that we investigated the relationship between the price of wine
and its age.
• However, the price of a bottle of wine might also depend on other
factors, for example, the amount of rainfall during the growing
season, average temperature during the growing season, etc. Below is
the full dataset showing price of wine, with corresponding Age, total
rainfall and average temperature during growing season
Bottle 1 2 3 4 5 6 7 8 9 10
Price (Y) 4.5 12.95 6.5 4.99 7.50 14.95 8.25 3.95 18.99 10.00
Age (X1) 3.5 5 3 2.5 3 2 2.5 1 10 4
Rainfall (X2) 126 121 125 106 107 112 124 105 116 108
Temp (X3) 16 20 17 18 18 22 19 15 21 20
Multiple Linear Regression: Wine Example
• As before, the regression equation is:
• Y = β0 + β1 X1 + β2X2 + β3X3+ ε
The β’s are parameters that need to be estimated. But now we have four
β’s. Again, we use SPSS to calculate them for us (next slide).
• β0 can be thought of as the intercept term as before (labelled as
“constant” in SPSS output)
• β1 is the ‘age coefficient’
• β2 is the ‘rainfall coefficient’
• β3 is the ‘temperature coefficient’
• As before, ε is the ‘random error’ term, assumed to be zero on
average
• Y = β 0 + β1 X 1 + β2 X 2 + β3 X 3 Multiple Linear Regression
• Y = −22.54 + 0.81 X1 − 0.0004 X2 + 1.55X3 p-values
• β1 = 0.81 is positive, indicates a positive relationship between age and price (i.e.
generally, older wines are more expensive). It is statistically significant (p-value < 0.05),
which suggest age is an important predictor of wine price. For every one year increase in
age, the selling price of a bottle of wine increases by about £0.81.
• β2 = −0.0004 is negative, this indicates a negative relationship between rainfall and price
(i.e. generally, wines from higher rainfall regions are cheaper). However, it is not
statistically significant (p-value > 0.05)
• β3 = 1.55 is positive, this indicates a positive relationship between temperature and price
(i.e. generally, wines from warmer regions are more expensive). It is statistically significant
(p-value < 0.05) which suggest temperature is an important predictor of wine price. For
every one degree increase in temperature, the selling price of a bottle of wine increases by
about £1.55.
What about the rest of the output?
The E x c e l output (below) also gives R-Square=0.91 (or 91%).
R 2 measures the percentage of variability in the Y data that is
explained by X .
If all our data lie on a straight line, X tells us everything
about Y , with no deviations from the line, and so R 2 = 100%
The closer R 2 is to 100%, the better!
Here, we see that about 91% of the variation in wine price is
explained by the age of the wine, rainfall, and temperature. The rest
of the variation (9%) may be explained by other factors.
Dr. Lee Fawcett
Now apply your knowledge on
regression analysis to the seminar
case studies:
• Part 1: Deciding where to locate a
business
• Part 2: Predicting the cost of
running a business