Chapter Notes (Chapters 1&2)
Chapter Notes (Chapters 1&2)
Summarising data
In this chapter we will introduce a few basic concepts about data sets, and how they can be
presented and summarised.
Definitions.
Here are a few common words and their definitions.
Population – refer to the complete set of objects of interest. Examples:
o All new cars sold by Ford UK in Leeds this year (is this representative?)
1
2 CHAPTER 1. SUMMARISING DATA
o “yes/no”
Quantitative data – are given as numeric values. Quantitative data can be discrete or
continuous.
Quantitative discrete data are typically whole numbers. Examples:
o Number of absent students in a class
Quantitative continuous data can take any value in a range. The data may be rounded
to distinct values but we still think of the data as being for a continuous variable. Examples:
o The height of students in a class
Exercise 1.1
For each of the following variables, state whether they are QUALITATIVE or QUANTI-
TATIVE. If they are quantitative, are they DISCRETE or CONTINUOUS?
Before looking at the answers, come up with answers yourself first.
Answers: (i) Quantitative discrete (ii) Qualitative (iii) Quantitative continuous (iv) Quanti-
tative discrete (v) Qualitative (vi) Quantitative continuous (vii) Qualitative (viii) Quantitative
continuous (ix) Quantitative discrete (x) Qualitative
Examples 1.2
o Descriptive statistics: A lecturer wants to summarise the performance of 70 Leeds foun-
dation year students on MATH0365 over the past two years.
o Inferential statistics: The government wants to assess the opinion of the British public
on whether or not we should join / leave the European Union.
The following video gives a summary of some of the concepts introduced above:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
4 CHAPTER 1. SUMMARISING DATA
Notation.
With quantitative data we often use xi or yi to denote an observation where i takes values
1, 2, . . . , n, and n is the number of observations in the sample.
Example 1.3
Suppose the variable of interest is the age of a student in years, and the population consists
of the students taking this module. If we take a sample of n = 5 observations, the data could
be
x1 = 20, x2 = 19, x3 = 21, x4 = 18, x5 = 24.
We often need to order quantitative data by size. We denote x(1) to be the smallest observa-
tion, x(2) the next smallest, and so on, and x(n) denotes the largest observation. (Note the
round brackets around the index!) For the example above we have
x(1) = 18, x(2) = 19, x(3) = 20, x(4) = 21, x(5) = 24.
Often we need to add together observations for quantitative data. We use the notation
Xn
xi = x1 + x2 + . . . + xn . In the example above
i=1
5
X
xi = x1 + x2 + x3 + x4 + x5 = 20 + 19 + 21 + 18 + 24 = 102.
i=1
n
X
Similarly, we can calculate x2i = x21 + x22 + . . . + x2n . In the example,
i=1
5
X
x2i = 202 + 192 + 212 + 182 + 242 = 400 + 361 + 441 + 324 + 576 = 2102.
i=1
Exercise 1.4
Let x1 = 2, x2 = 8, x3 = 4, x4 = 12. Calculate each of the following.
3
X
(i) xi =
i=1
4
X
(ii) x2i =
i=1
(iii) x(3) =
5
Try the exercise yourself before looking at this video which explains the answers:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
Stem-and-Leaf Diagrams.
Stem-and-leaf diagrams are used to display quantitative data in a more condensed form,
without losing the detail of the original data.
Example 1.5
The scores of 10 adults in a test are:
114, 99, 131, 124, 117, 102, 106, 127, 119, 114.
We use the last digit (the leaf) to represent each data item, with a stem consisting of the
previous digits. We always include a key on the diagram.
9 9
10 2 6
11 4 4 7 9
12 4 7
13 1
From the diagram we see the lowest test score is x(1) = 99 and the highest was x(10) = 131;
40% of scores were in the range 110-119.
6 CHAPTER 1. SUMMARISING DATA
Example 1.6
The closing prices (to the nearest £) of 20 common stocks on a certain date were
30, 34, 43, 9, 38, 9, 8, 29, 35, 19, 9, 17, 38, 54, 17, 1, 48, 18, 9, 9.
0 1 8 9 9 9 9 9
1 7 7 8 9
2 9
3 0 4 5 8 8
4 3 8
5 4
We can see that the cheapest stock was £1 and the most expensive stock was £54. There
appears to be two different groupings of stock. Those with prices in the range £1-£19, and
those with prices in excess of £30.
The following video discusses a few remarks on stem and leaf diagrams:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
7
Histograms.
A stem-and-leaf diagram is useful if there are a small number of observations. For large
datasets histograms are more useful.
To construct a histogram we count the number of observations that lie within classes that we
choose. The classes span the range of the data and are non-overlapping. Bars are then drawn
which have areas proportional to the number of observations in the class.
IMPORTANT: It is the AREA of the bar that is proportional to the number of observations.
The number of observations in a class is called the class frequency.
Example 1.7
The data below refers to the amount spent on food (in £) in one week by 40 UK households.
44.66, 18.60, 83.27, 59.90, 62.45, 50.96, 49.08, 60.08, 34.71, 53.75,
79.22, 36.84, 30.45, 61.63, 55.80, 52.00, 48.92, 57.23, 40.50, 52.81,
78.94, 63.89, 24.65, 74.56, 69.92, 20.21, 28.93, 65.04, 76.60, 73.12,
65.55, 68.89, 50.15, 54.99, 87.31, 48.92, 40.81, 43.00, 95.90, 46.81.
We are free to choose the widths of the classes, though remember they must be non-overlapping
and span the range of the data. In the above data the observations range from 18.60 to 95.90.
One possible approach, among many others, is to choose classes 17.495-27.495, 27.495-37.495,
. . . , 87.495-97.495.
Note: Here the boundaries of the classes were chosen to have the three digits .495 in order
to avoid for any of the data points to fall on the boundary between two classes. This is not
strictly necessary, as long as we are consistent and make clear to the reader how we group
the data points into the classes.
Figure 1.1: Histogram showing the amount spent on food (in £) in one week by 40 households.
The following video explains what is represented in the histogram, in particular on the vertical
axis:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
Note that the width of each class is 10. The first bar over the interval from 17.495 to 27.495
for example, should have an area of 3, since this is the number of observations, i.e. frequency,
in this class. So the height of this bar is 0.3, since 0.3 × 10 = 3. Similarly, for the other
intervals.
So the height of the bars is the class frequency divided by the class width. This quantity is
referred to as the frequency density of a class, this is the quantity recorded on the vertical
axis:
Class Frequency
Frequency density =
Class width
9
Figure 1.2: Histogram showing amount spent on food (in £) in one week by 40 households.
10 CHAPTER 1. SUMMARISING DATA
Note how the area of the rectangle over the interval from 37.495 to 57.495 on Figure 1.2 is
the sum of the areas of the two rectangles over the intervals from 37.495 to 47.495 and 47.495
to 57.495 in Figure 1.1., as explained in this video:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
Exercise 1.8
Using some graph or other grid paper (if you have access to some), construct a histogram to
display the following data which refers to the amount of time (to the nearest minute) students
spent studying for a test. Before drawing the histogram, determine the frequency densities,
and decide on the scale for the histogram.
Try to do this yourself first, before watching this video of the solution:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
11
The Mode.
The mode is defined to be the most frequently occurring value in a set of data.
Example 1.9
The data below gives the number of students present at a statistics tutorial over 10 consecutive
weeks.
8, 5, 7, 10, 8, 6, 5, 6, 4, 8.
Arranging in ascending numerical order we have
4, 5, 5, 6, 6, 7, 8, 8, 8 , 10.
The mode of this set of data is 8. We also say that “the modal number of students present is
8”.
The mode need not be unique. If two values occur most frequently, the data are said to be
bimodal. If more than two values occur most frequently, the data are multimodal.
Note that the mode can be determined for qualitative and quantitative data, and is guaranteed
to be a value actually observed.
The Median.
The median is defined as the “middle value” when the data are arranged in ascending numer-
ical order. More precisely, for a data set with n observations,
1h i
if n is even, the the median is x( n ) + x( n +1) .
2 2 2
So, for an odd number of observations the median is the value “in the middle”; for an even
number of observations the median is the average of the two “middle values”. In this video
we look at these formulas a bit more closely:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
12 CHAPTER 1. SUMMARISING DATA
Example 1.10
The data set
0, 7, 7, 19, 19, 20, 35,
has 7 data points. So the median is the 4th in this ordered list, i.e. the median is 19.
Example 1.11
Given the data from above with 10 data points,
4, 5, 5, 6, 6, 7 , 8, 8, 8, 10,
the median is the average of the 5th and 6th observations, which is
6+7 13
= = 6.5,
2 2
i.e. the median number of students attending the tutorial is 6.5.
Note that the median can be number which is not an actual observation.
The Mean.
The mean, or sample mean, or average, is calculated by adding up all of the observations
and dividing by the number of observations there are. For observations xi , where i takes
values 1, 2, 3 . . . , n, we write
n
x1 + x2 + x3 + · · · + xn 1X
sample mean = x̄ = = xi .
n n i=1
We denote the sample mean as x̄ (x with a horizontal bar on top), pronounced “x bar”.
Example 1.12
The mean for the data from above is
10
1 X 8 + 5 + 7 + 10 + 8 + 6 + 5 + 6 + 4 + 8 67
x̄ = xi = = = 6.7
10 i=1 10 10
The mean number of students attending the tutorial is 6.7 (again, not a number that we could
actually observe).
The mean is a very widely used measure, as it has many desirable statistical properties.
13
Example 1.13
A survey of subscribers to Hello magazine are asked the following question: “How many of
the last four issues have you read or looked through?”
Suppose that the following frequency distribution summarises 500 responses.
This video discusses how we should read this table, compute the mean and the median:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
Let’s now low at an example where the data are given as grouped data without the original
individual data values.
Example 1.14
Cars traveling on a road with a posted speed limit of 60 miles per hour are checked for speed
by a police radar system, giving the following frequency distribution of speeds.
Speed (mile per hour) 45-49 50-54 55-59 60-64 65-69 70-74 75-79 Total
Midpoint xi 47 52 57 62 67 72 77
Frequency fi 10 40 150 175 75 15 10 n = 475
f i xi 470 2080 8550 10850 5025 1080 770 28825
To estimate the mean (it can only be an estimate, since we do not know the original data
values!) we use the midpoints of the classes as the xi ’s in the mean of frequency data formula.
m
1X 28825
x̄ = f i xi = = 60.69.
n i=1 475
We see that the mean speed of a car checked by the police was 60.69 miles per hour – just in
excess of the speed limit!
Remark 1.15
The mean may be significantly affected by the inclusion of a mistaken observation, as the
following example shows:
Example 1.16
Suppose in Example 1.9 the observation 10 is recorded as 100 by mistake. The data is now
4, 5, 5, 6, 6, 7 , 8, 8, 8 , 100.
6+7
The mode does not change, it is still 8. The median is unchanged, it is still 2
= 6.5.
However, the mean is now:
10
1 X 4 + 5 + 5 + 6 + 6 + 7 + 8 + 8 + 8 + 100 157
x̄ = xi = = = 15.7.
10 i=1 10 10
53, 39, 39, 33, 69, 30, 25, 67, 130, 94, 40
Calculate the mean, the median and the mode of these data.
15
Quartiles.
The median is the value that splits the ordered data into two halves. Recall that we defined
it as follows:
o If n is odd, then the median is x( n+1 ) , and is an actual data point. In this case we
2
include the median in the lower half and in the the upper half,
x( n ) +x( n +1)
o If n is even, the the median is 2 2 2 . In this case the lower half consists simply of
the lower n2 values, and the upper half of the upper n2 values.
The quartiles split the ordered data into quarters. We refer to the median as Q2 , and define
the lower quartile Q1 to be the median of the lower half, and the upper quartile Q3 to
be the median of the upper half.
!!! Be careful when reading textbooks, or using calculators or other software to compute quar-
tiles. There are a number of different ways to define the lower and upper quartile producing
different results !!!
Example 1.18
In a survey of 21 households the number of telephones used by each household are given
below.
1, 3, 4, 1, 1, 2, 1, 1, 2, 5, 1, 2, 3, 0, 2, 1, 2, 1, 3, 0, 4.
Arranging the data in ascending numerical order gives
0, 0, 1, 1, 1, x(6) = 1, 1, 1, 1, 1, x(11) = 2 , 2, 2, 2, 2, x(16) = 3, 3, 3, 4, 4, 5.
There are n = 21 data items. Twenty-one is an odd number, and so the median number of
telephones per household is data item Q2 = x( n+1 ) = x(11) = 2. The lower quartile, Q1 is the
2
median of the data items x(1) , x(2) , x(3) , . . . , x(11) , which is x(6) = 1. The upper quartile, Q3 is
the median of the data items x(11) , x(2) , x(3) , . . . , x(21) , which is x(16) = 3.
This video explains the example above in more details:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
16 CHAPTER 1. SUMMARISING DATA
Example 1.19
The data below gives the monthly starting salaries (in $) of 12 business school graduates.
2850, 2950, 3050, 2880, 2755, 2710, 2890, 9130, 2940, 3325, 2920, 2880.
x(6) + x(7)
Since 12 is even, the median Q2 = = 2905.
2
x(3) + x(4)
The lower quartile Q1 is the median of x(1) , x(2) , . . . , x(6) , which is Q1 = = 2865.
2
x(9) + x(10)
The upper quartile Q3 is the median of x(7) , x(8) , . . . , x(12) , which is Q3 = = 3000.
2
See whether you can get the same answers. This video goes through the process of determining
Q1 , Q2 , and Q3 for this example:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
We are now introducing a number of concepts which measure the dispersion, i.e. the spread
or variability of a quantitative data set.
The Range.
The range is simply the difference between the largest and the smallest observation, i.e. in
a data set with n observations
IQR = Q3 − Q1 .
For Example 1.19, the interquartile range IQR is $3000 − $2865 = $135.
17
Example 1.20
The data given in in Example 1.9 with the number of students present at a statistics tutorial
over 10 consecutive weeks was
8, 5, 7, 10, 8, 6, 5, 6, 4, 8,
For larger data sets, this involves a lot of calculations. Here is a more “calculator friendly”
form of the sample variance:
!2
n n
1 X 1 X
s2 = x2i − xi .
n − 1 i=1 n i=1
18 CHAPTER 1. SUMMARISING DATA
One can show that the two formulas produce the same results. We will show this for n = 3
in this video:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
So in the example from above, we could compute the sample variance instead as follows:
n
X
x2i = 82 + 52 + 72 + 102 + 82 + 62 + 52 + 62 + 42 + 82 = 479, and
i=1
n
X
xi = 8 + 5 + 7 + 10 + 8 + 6 + 5 + 6 + 4 + 8 = 67, so
i=1
672
2 1 1
s == 479 − = (479 − 448.9) = 3.3.
9 10 9
Example 1.21
Let us return to the survey on Hello subscribers from earlier in the chapter.
Here is how we calculate the sample variance s2 for these data, using the frequencies fi .
19
m
X
We have n = fi = 500. So
i=1
!2
m m
17452
2 1 X 1 X 1
s = fi x2i − f i xi = 6535 − = 0.89.
n − 1 i=1 n i=1
499 500
In this video we briefly explain how to apply the formula for the given frequency table:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
Example 1.22
We now look at the police speed check example, and see whether you can follow how the
formula below was used.
Speed (mile per hour) 45-49 50-54 55-59 60-64 65-69 70-74 75-79 Total
Midpoint xi 47 52 57 62 67 72 77
Frequency fi 10 40 150 175 75 15 10 n = 475
f i xi 470 2080 8550 10850 5025 1080 770 28825
fi x2i 22090 108160 487350 672700 336675 77760 59290 1764025
Here
!2
m m
288252
1 X 1 X 1
2
s = fi x2i − f i xi = 1764025 − = 31.23 mph2 .
n − 1 i=1 n i=1
474 475
20 CHAPTER 1. SUMMARISING DATA
Exercise 1.23
(We will go through this one in the lecture.)
The data below refer to the number of brothers and sisters a sample of students have. You
will need to calculate the values that go in place of the constants B1 , B2 , B3 and B4 .
Also calculate the mean, the median, the sample variance, and the standard deviation for
these data.
(Q3 − Q2 ) − (Q2 − Q1 )
.
Q3 − Q1
Q3 − 2Q2 + Q1
Note that this is the same as .
Q3 − Q1
This coefficient takes values between −1 and +1. A value near zero indicates symmetry, a
negative value indicates that the data is negatively skewed, a positive value indicates that the
data is positively skewed. The following video explains how the above formula represents the
skewness of the data:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
21
Box Plots.
A useful way of comparing two data sets is to produce box plots. To construct a box plot you
must first calculate the median Q2 and the quartiles, Q1 and Q3 . The general form of a box
plot is shown below: This video gives a brief introduction to box plots:
Smallest Largest
observation observation
Q Median Q
1 3
(Open the link to the video in a new tab or new window - usually done by using right-click.)
Example 1.24
The following data gives systolic blood pressure of 12 smokers and 12 non-smokers.
Smokers 122 146 120 114 124 126 118 128 130 134 116 130
Non-smokers 114 134 114 116 138 110 112 116 132 126 108 116
For the smokers the median Q2 = 125, the lower quartile Q1 = 119, and the upper quartile
Q3 = 130.
For the non-smokers, the median Q2 = 116, the lower quartile Q1 = 113, and the upper
quartile Q3 = 129.
22 CHAPTER 1. SUMMARISING DATA
Smokers
100 Non−smokers
110
120
130
140
150
Blood pressure
Figure 1.4: Box plots comparing blood pressure in smokers and non-smokers.
We see that the blood pressures of the non-smokers tend to be lower than those of the
smokers (the box in the non-smokers’ plot is shifted to the left compared to the smokers’ plot).
However, blood pressure amongst non-smokers appears to be more variable than amongst
smokers (the box in the non-smokers’ plot is wider than the box in the smokers’ plot).
An outlier is an extremely high or extremely low observation. An outlier may be a data item
that has been incorrectly recorded (in which case it should be removed from the data set), or
it may be a genuine observation (but unusual in some way). An observation is identified as
an outlier if it is less than
3
Q1 − (Q3 − Q1 ),
2
or greater than
3
Q3 + (Q3 − Q1 ).
2
Exercise 1.25
(We will go through this one in the lecture.)
Here are the case prices (in $) for 13 wines produced in the USA.
52, 66, 70, 80, 95, 100, 110, 112, 115, 118, 123, 143, 151.
(iii) Construct a box plot to display the data, using graph or grid paper (if available).
CHAPTER 2
When looking at data we often are interested in whether there is a relationship or dependence
between two or more variables. For example, you could be looking for evidence that the height
of the parents and the height of the children are related, or that there is some connection
between the age of people and a certain illness.
We will focus on the relationship between two variables, which we will typically call x and y.
If one of the two variables can be identified as potentially affecting the other, we would name
that variable x. This is not always the case though.
Scatter Diagrams.
When dealing with two (or more) variables it is always useful to produce a scatter plot. The
variable x is plotted on the horizontal axis and the variable y is plotted on the vertical axis.
Example 2.1
The variable x is the number of TV commercials for a product shown during a week, and the
variable y is the volume of sales (in £100s) during the following week:
Number of commercials, xi 2 5 1 3 4 1 5 3 4 2
Sales (£100s), yi 50 57 41 54 54 38 63 48 59 46
24
25
70
60
Sales
50
40
30
1 2 3 4 5
N m r f mm ri i l h wn
We can see from the scatter plot that as the number of commercials increase, the sales increase
as well. We say that the number of commercials and the sales seem to be “correlated”. The
correlation coefficient, which we define in the next section, is a numerical measure for how
strong this “correlation” is.
Correlation.
The correlation coefficient r is a number between −1 and +1 that quantifies how closely two
variables x and y follow a linear relation, and whether this relationship is positive or negative.
It is defined as
Sxy
r=p
Sxx Syy
where
n
X n
X n
X
Sxy = (xi − x̄)(yi − ȳ) , Sxx = (xi − x̄)2 , Syy = (yi − ȳ)2 ,
i=1 i=1 i=1
o If r > 0 then there is a positive linear relationship between the xi and the yi , meaning
that as the xi get larger, the yi tend to get larger as well at a (more or less) constant
rate.
26 CHAPTER 2. CORRELATION AND REGRESSION
o If r < 0 then there is a negative linear relationship between the xi and the yi , meaning
that as the xi get larger, the yi tend to decrease at a (more or less) constant rate.
The closer the correlation coefficient is to +1 or −1, respectively, the stronger/more
perfect this linear relationship is.
Figure 2.2: The points lie on a straight line with a positive gradient.
Figure 2.3: The points lie on a straight line with a negative gradient.
In this video we show why the correlation coefficient is +1 and -1 in these special cases:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
27
However, these cases where the correlation coefficient r is equal to 1, −1, or 0 are exceptional;
for real data, the correlation coefficient would almost never be equal to 1, −1, or 0, it would
typically be somewhere in between.
To understand a bit better why the formula for the correlation coefficient has the form it has,
consider the following example:
Let x1 , . . . , xn be the individual marks of students in a mock exam, and let y1 , . . . , yn be
the individual marks of those students in the final exam. We expect that there is a strong
positive relationship/correlation between the variable x and y, i.e. that a high mark in the
mock exam will most likely result in a high mark in the final exam, and similarly for low
marks. So we anticipate that a positive value for xi − x̄ (high mark in the mock exam) goes
together with a positive value for yi − ȳ (high mark in the final exam), and that a negative
value for xi − x̄ (low mark in the mock exam) goes together with a negative value for yi − ȳ
(low mark in the final exam). In both cases this results in a positive value for (xi − x̄)(yi − ȳ)
in the numerator. The denominator simply scales the quantity to values between -1 and 1.
Hence, correlation coefficients near −1 or +1 are indicative of a strong linear relationship
between the variables x and y, and correlation coefficients near zero indicates there is no, or
a very weak linear relationship between the variables. Values outside the range of -1 to 1
indicate that a mistake in the calculation has occurred.
It is important to note that the correlation coefficient does not tell us anything about
the slope or gradient of the linear relationship, if there is one.
Fact 2.2
Notice the similarity of the formulas for Sxx and Syy to the formulas for the sample variance.
As with the formula for the variance, there are more calculator-friendly versions of the formulas
for Sxy , Sxx , and Syy . These are:
28 CHAPTER 2. CORRELATION AND REGRESSION
n n
! n
!
X 1 X X
Sxy = xi yi − xi yi ,
i=1
n i=1 i=1
n n
!2
X 1 X
Sxx = x2i − xi ,
i=1
n i=1
n n
!2
X 1 X
Syy = yi2 − yi .
i=1
n i=1
Try this exercise yourself first before looking at the answer in this video:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
29
Example 2.4
Consider the following data which refers to the the amount of fat (in grams) and the number
of calories in a sample of fast food beef burgers. Draw a scatter diagram of these data, and
compute the correlation coefficient using the formulas above.
Fat (grams), xi 19 31 34 35 39 39 43
Calories, yi 410 580 590 570 640 680 660
We will look at this example in the lecture, but you could try it yourself before-hand.
A Misuse of Correlation.
0.4
0.2
0.0
9
X 9
X 9
X 9
X 9
X
xi = 0, x2i = 3.75, n = 9, yi = 6, yi2 = 5.27, xi yi = 0.
i=1 i=1 i=1 i=1 i=1
30 CHAPTER 2. CORRELATION AND REGRESSION
9 9
!2
X 1 X 02
Sxx = x2i − xi = 3.75 − = 3.75,
i=1
9 i=1
9
9 9
!2
X 1 X 62
Syy = yi2 − yi
= 1.27, = 5.27 −
i=1 i=1
9 9
9 9
! 9
!
X 1 X X 0×6
Sxy = x i yi − xi yi = 0 − = 0.
i=1
9 i=1 i=1
9
Sxy 0
r=p =√ = 0.
Sxx Syy 3.75 × 1.27
Here r = 0, indicating that a linear relationship is not appropriate. Without plotting the
data we might have mistakenly concluded that there was no relationship between the two
variables.
xi 1 2 3 4 5
yi 5 4 3 2 100
40
20
0
1 2 3 4 5
5
X 5
X 5
X 5
X 5
X
xi = 15, x2i = 55, n = 5, yi = 114, yi2 = 10054, xi yi = 530.
i=1 i=1 i=1 i=1 i=1
31
5 5
!2
X 1 X 152
Sxx = x2i − xi = 55 − = 10,
i=1
5 i=1
5
5 5
!2
X 1 X 1142
Syy = yi2 − yi = 10054 − = 7454.8,
i=1
5 i=1
5
5 5
! 5
!
X 1 X X 15 × 114
Sxy = xi y i − xi yi = 530 − = 188.
i=1
5 i=1 i=1
5
Sxy 188
r=p =√ = 0.69.
Sxx Syy 10 × 7454.8
A correlation coefficient of 0.69 would indicate that there is a positive linear relationship
between the variables x and y. However, if the value of 100 had been correctly recorded
as 1 then we would see that there is a perfect negative linear relationship between the
variables, and the correlation coefficient would take the value −1. It is easy to make a mistake
in recording data, or entering it into a computer. Failure to spot such a mistake can lead to
incorrect conclusions!
10
5
Figure 2.7: Scatter diagram of rum prices in Havana against salaries of ministers in Mas-
sachusetts.
If you are interested in seeing other examples of spurious correlations, have a look at this web
site: https://www.tylervigen.com/spurious-correlations.
Linear Regression
Linear regression is an important and widely used technique to model relationships between
variables. It is used for various purposes, for example to explain and quantify a relationship,
but also to make predictions.
Suppose the variable x can affect the variable y and we want to model the relationship in
terms of fitting the most appropriate straight line to the data. We use the method of least
squares regression.
We fit the straight line relationship y = a + bx to the data. In general it is not possible to
find a line that fits the data exactly. At each point xi there will be an error (residual) in
our estimated value of yi which we denote as ri = yi − (a + bxi ). One sensible approach is
y=a+bx
r5
r4
r3
y
r2
r1
to choose the straight line with equation y = a + b x that minimises the sum of the ri2 (i.e.
X n
makes ri2 as small as possible).
i=1
Sxy
b= , a = ȳ − bx̄,
Sxx
Linear regression has many practical uses. One is to make predictions. The regression line
provides a model which allows us to plug in values for x for which we do not have accompanying
values for y, and so we can make predictions for the values of y.
Example 2.5
Suppose that we are interested in the relationship between the number of staff at an insurance
firm and the number of policies the firm issues (per month). It seems sensible to suggest that
firms with more staff will issue a larger number of insurance policies. How can we verify and
quantify this relationship?
The following data refers to 25 independent insurance firm.
Number of staff, xi 3 6 5 8 16 15 23 5 13
Number of policies, yi 44 144 150 236 739 970 2371 309 679
Number of staff, xi 4 19 33 19 10 16 22 2 3
Number of policies, yi 26 1272 3246 1904 357 1080 1027 45 62
Number of staff, xi 2 22 2 18 21 24 9
Number of policies, yi 68 2507 138 502 1501 2750 192
The first step when looking for a relationship between two variables is to plot them against
each other.
3000
2500
Number of policies
2000
1500
1000
500
0
5 10 15 20 25 30
N m r f ff
Figure 2.9: Scatter diagram of number of policies against number of staff for 25 insurance
firms.
34 CHAPTER 2. CORRELATION AND REGRESSION
It seems that there is indeed a relationship between the two variables; firms with a larger
number of staff do issue more insurance policies! The plot of the data confirms that it
is appropriate to fit a straight linear between the variables. We begin by calculating the
correlation coefficient, r.
xi yi x2i yi2 xi y i
3 44 9 1936 132
6 144 36 20736 864
5 150 25 22500 750
8 236 64 55696 1888
16 739 256 546121 11824
15 970 225 940900 14550
23 2371 529 5621641 54533
5 309 25 95481 1545
13 679 169 461041 8827
4 26 16 676 104
19 1272 361 1617984 24168
33 3246 1089 10536516 107118
19 1904 361 3625216 36176
10 357 100 127449 3570
16 1080 256 1166400 17280
22 1027 484 1054729 22594
2 45 4 2025 90
3 62 9 3844 186
2 68 4 4624 136
22 2507 484 6285049 55154
2 138 4 19044 276
18 502 324 252004 9036
21 1501 441 2253001 31521
24 2750 576 7562500 66000
9 192 81 36864 1728
320 22319 5932 42313977 470050
25
X 25
X
xi = 320, x2i = 5932, n = 25, x̄ = 12.8.
i=1 i=1
25
X 25
X 25
X
yi = 22319, yi2 = 42313977, xi yi = 470050, ȳ = 892.76.
i=1 i=1 i=1
2.1. LINEAR REGRESSION 35
25 25
!2
X 1 X 3202
Sxx = x2i − xi = 5932 − = 1836,
i=1
25 i=1
25
25 25
!2
X 1 X 223192
Syy = yi2 − yi = 42313977 − = 22388467,
i=1
25 i=1
25
25 25
! 25
!
X 1 X X 320 × 22319
Sxy = xi y i − xi yi = 470050 − = 184366.8.
i=1
25 i=1 i=1
25
Sxy 184366.8
r=p =√ = 0.91.
Sxx Syy 1836 × 22388467
We want to use the number x of staff a firm has to predict the number y of policies they
issue. We will now fit the regression line y = a + bx. Here
Sxy 184366.8
b= = = 100.42,
Sxx 1836
a = ȳ − bx̄ = 892.76 − 100.42 × 12.8 = −392.59.
We have the regression line y = −392.59 + 100.42x.
We interpret the parameter b (the gradient of the regression line) to mean that, for every
extra member of staff, a firm issues 100.42 extra insurance policies per month (or, phrased in
a more sensible manner, each extra member of staff issues approximately 100 extra policies).
Remember how the gradient represents the change in y as we increase x by 1.
The value of the parameter a here does not have a sensible physical interpretation. What
it appears to suggest is that a firm with no staff (!) issues −393 insurance policies, clearly
nonsense!
This illustrates an important point. Do not attempt to extend results too far outside the
range of the data. The smallest firm we have recorded has two staff, we should not try to
draw conclusions for firms much smaller than this.
The next step is to plot our regression line on the same axes as the original data, to check
that we have not made a mistake, and that the relationship seems appropriate.
The easiest way to add the regression line to the plot is to calculate the value of y for several
points on the x-axis, joining the resulting points by a straight line.
We add these points to our original plot and join them by a line.
3000
2500
Number of policies
2000
1500
1000
500
0
5 10 15 20 25 30
N m r f ff
Figure 2.10: Scatter diagram of number of policies against number of staff for 25 insurance
firms
In this video we discuss a slightly different way to draw the regression line, and discuss
the regression line briefly:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
How can we make use of this regression line? Suppose we are interested in a firm that has 27
staff, and want to estimate the number of policies it will issue in a month. If x = 27, then we
predict
y = −392.59 + 100.42 × 27 = 2318.75.
We round the answer to the nearest whole number and report that a firm with 27 staff will
issue approximately 2319 insurance policies per month.
Example 2.6
Using the data and the diagram you drew in Exercise 2.4, compute the parameters a and b
in the regression line y = a + bx and add your regression line to the scatter plot.
We will look at this example in the lecture, or you can try it yourself.