[go: up one dir, main page]

0% found this document useful (0 votes)
32 views36 pages

Chapter Notes (Chapters 1&2)

Uploaded by

DANIYA GENERAL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views36 pages

Chapter Notes (Chapters 1&2)

Uploaded by

DANIYA GENERAL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

CHAPTER 1

Summarising data

In this chapter we will introduce a few basic concepts about data sets, and how they can be
presented and summarised.

Definitions.
Here are a few common words and their definitions.
Population – refer to the complete set of objects of interest. Examples:

o All students at Leeds University

o All new cars sold by Ford UK this year

o All oranges in a crate

Sample – refers to a subset of the population, usually chosen to be representative of the


population with respect to some characteristic. Examples:

o All students registered on this module (is this representative?)

o All new cars sold by Ford UK in Leeds this year (is this representative?)

Variable – refers to the quantity being measured. Examples:

o Age of a student; colour of eyes

o Size of car engine

o Ripeness of the fruit

1
2 CHAPTER 1. SUMMARISING DATA

Observation or data item – refers to the result of a measurement. Examples:


o Age of a student is “19”; colour of a eyes is “green”

o Size of car engine is “2200ccm”

o The orange is “half-ripe”

Data can be given in different forms:


Qualitative data – are given as descriptions using names. Examples:
o “brown/blue”

o “yes/no”

o “ ripe/ half-ripe/not ripe/rotten”

Quantitative data – are given as numeric values. Quantitative data can be discrete or
continuous.
Quantitative discrete data are typically whole numbers. Examples:
o Number of absent students in a class

o Number of red cars sold by Ford in the UK

o Number of rotten oranges in a crate

Quantitative continuous data can take any value in a range. The data may be rounded
to distinct values but we still think of the data as being for a continuous variable. Examples:
o The height of students in a class

o The time taken for Ford to produce a car in the UK

o Size of a car engine

Exercise 1.1
For each of the following variables, state whether they are QUALITATIVE or QUANTI-
TATIVE. If they are quantitative, are they DISCRETE or CONTINUOUS?
Before looking at the answers, come up with answers yourself first.

(i) Height of a person to the nearest centimetre

(ii) Height of a person, classed as short/medium/tall

(iii) Height of a person

(iv) Annual number of items sold of a product


3

(v) Soft-drink size, classed as small, medium or large

(vi) Earnings per share

(vii) Method of payment (cash, cheque or credit card)

(viii) Time to pay

(ix) Time to pay in days

(x) Ripeness of oranges

Answers: (i) Quantitative discrete (ii) Qualitative (iii) Quantitative continuous (iv) Quanti-
tative discrete (v) Qualitative (vi) Quantitative continuous (vii) Qualitative (viii) Quantitative
continuous (ix) Quantitative discrete (x) Qualitative

Statistics – is the science of data collection and data analysis.


Descriptive Statistics – is concerned with methods for summarising the data from a sample.
Inferential Statistics – is concerned with estimating properties of the population based on
data from a sample.
Note: The word ‘inference’ refers to a conclusion or opinion drawn from information, evidence
or reasoning.

Examples 1.2
o Descriptive statistics: A lecturer wants to summarise the performance of 70 Leeds foun-
dation year students on MATH0365 over the past two years.

o Inferential statistics: An auditor needs to determine whether the transactions on a


client’s balance sheet give an accurate representation of its financial circumstances.

o Inferential statistics: The government wants to assess the opinion of the British public
on whether or not we should join / leave the European Union.

The following video gives a summary of some of the concepts introduced above:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
4 CHAPTER 1. SUMMARISING DATA

Notation.
With quantitative data we often use xi or yi to denote an observation where i takes values
1, 2, . . . , n, and n is the number of observations in the sample.
Example 1.3
Suppose the variable of interest is the age of a student in years, and the population consists
of the students taking this module. If we take a sample of n = 5 observations, the data could
be
x1 = 20, x2 = 19, x3 = 21, x4 = 18, x5 = 24.

We often need to order quantitative data by size. We denote x(1) to be the smallest observa-
tion, x(2) the next smallest, and so on, and x(n) denotes the largest observation. (Note the
round brackets around the index!) For the example above we have

x(1) = 18, x(2) = 19, x(3) = 20, x(4) = 21, x(5) = 24.

Often we need to add together observations for quantitative data. We use the notation
Xn
xi = x1 + x2 + . . . + xn . In the example above
i=1

5
X
xi = x1 + x2 + x3 + x4 + x5 = 20 + 19 + 21 + 18 + 24 = 102.
i=1

n
X
Similarly, we can calculate x2i = x21 + x22 + . . . + x2n . In the example,
i=1

5
X
x2i = 202 + 192 + 212 + 182 + 242 = 400 + 361 + 441 + 324 + 576 = 2102.
i=1

Exercise 1.4
Let x1 = 2, x2 = 8, x3 = 4, x4 = 12. Calculate each of the following.
3
X
(i) xi =
i=1

4
X
(ii) x2i =
i=1

(iii) x(3) =
5

Try the exercise yourself before looking at this video which explains the answers:
(Open the link to the video in a new tab or new window - usually done by using right-click.)

Stem-and-Leaf Diagrams.
Stem-and-leaf diagrams are used to display quantitative data in a more condensed form,
without losing the detail of the original data.
Example 1.5
The scores of 10 adults in a test are:

114, 99, 131, 124, 117, 102, 106, 127, 119, 114.

We use the last digit (the leaf) to represent each data item, with a stem consisting of the
previous digits. We always include a key on the diagram.

9 9
10 2 6
11 4 4 7 9
12 4 7
13 1

Table 1.1: Stem-and-leaf diagram. Key: Stem|leaf = 9|9 means 99.

From the diagram we see the lowest test score is x(1) = 99 and the highest was x(10) = 131;
40% of scores were in the range 110-119.
6 CHAPTER 1. SUMMARISING DATA

Example 1.6
The closing prices (to the nearest £) of 20 common stocks on a certain date were

30, 34, 43, 9, 38, 9, 8, 29, 35, 19, 9, 17, 38, 54, 17, 1, 48, 18, 9, 9.

Representing the above in a stem-and-leaf diagram we have

0 1 8 9 9 9 9 9
1 7 7 8 9
2 9
3 0 4 5 8 8
4 3 8
5 4

Table 1.2: Stem-and-leaf diagram. Key: Stem|leaf = 4|3 means £43.

We can see that the cheapest stock was £1 and the most expensive stock was £54. There
appears to be two different groupings of stock. Those with prices in the range £1-£19, and
those with prices in excess of £30.
The following video discusses a few remarks on stem and leaf diagrams:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
7

Histograms.
A stem-and-leaf diagram is useful if there are a small number of observations. For large
datasets histograms are more useful.
To construct a histogram we count the number of observations that lie within classes that we
choose. The classes span the range of the data and are non-overlapping. Bars are then drawn
which have areas proportional to the number of observations in the class.
IMPORTANT: It is the AREA of the bar that is proportional to the number of observations.
The number of observations in a class is called the class frequency.
Example 1.7
The data below refers to the amount spent on food (in £) in one week by 40 UK households.

44.66, 18.60, 83.27, 59.90, 62.45, 50.96, 49.08, 60.08, 34.71, 53.75,
79.22, 36.84, 30.45, 61.63, 55.80, 52.00, 48.92, 57.23, 40.50, 52.81,
78.94, 63.89, 24.65, 74.56, 69.92, 20.21, 28.93, 65.04, 76.60, 73.12,
65.55, 68.89, 50.15, 54.99, 87.31, 48.92, 40.81, 43.00, 95.90, 46.81.

We are free to choose the widths of the classes, though remember they must be non-overlapping
and span the range of the data. In the above data the observations range from 18.60 to 95.90.
One possible approach, among many others, is to choose classes 17.495-27.495, 27.495-37.495,
. . . , 87.495-97.495.
Note: Here the boundaries of the classes were chosen to have the three digits .495 in order
to avoid for any of the data points to fall on the boundary between two classes. This is not
strictly necessary, as long as we are consistent and make clear to the reader how we group
the data points into the classes.

Before we can construct the histogram we need to produce a frequency table.

Spending per week (£) Tally Frequency (number of observations)


17.495-27.495 ||| 3
27.495-37.495 |||| 4
37.495-47.495 |||| 5
47.495-57.495 |||| |||| | 11
57.495-67.495 |||| || 7
67.495-77.495 |||| 5
77.495-87.495 |||| 4
87.495-97.495 | 1
Total 40

Table 1.3: Frequency table for food shopping data.


8 CHAPTER 1. SUMMARISING DATA

Here is the histogram:

Figure 1.1: Histogram showing the amount spent on food (in £) in one week by 40 households.

The following video explains what is represented in the histogram, in particular on the vertical
axis:
(Open the link to the video in a new tab or new window - usually done by using right-click.)

Note that the width of each class is 10. The first bar over the interval from 17.495 to 27.495
for example, should have an area of 3, since this is the number of observations, i.e. frequency,
in this class. So the height of this bar is 0.3, since 0.3 × 10 = 3. Similarly, for the other
intervals.
So the height of the bars is the class frequency divided by the class width. This quantity is
referred to as the frequency density of a class, this is the quantity recorded on the vertical
axis:

Class Frequency
Frequency density =
Class width
9

Looking at a histogram, we can make a number of observations. For example:


The histogram shows that the smallest amount a family spent on food was around £17.5, and
the largest amount was around £97.5. The “average” amount spent on food is around £57.
Note that in the example above all of the classes have the same width, so the frequency
density is the frequency divided by 10 for each of the classes, i.e. in the example above the
frequency density is proportional to the frequency. This is not always the case, and becomes
important when not all the classes have the same width.
Let’s now look at the same data, but with one of the classes having a different width:
Using the same data on food shopping from above, suppose the classes 37.495-47.495 and
47.495-57.495 were merged to produce a single class 37.495-57.495. The frequency table would
appear as follows:

Spending per week (£) Class width Frequency Frequency density


17.495-27.495 10 3 0.3
27.495-37.495 10 4 0.4
37.495-57.495 20 16 0.8
57.495-67.495 10 7 0.7
67.495-77.495 10 5 0.5
77.495-87.495 10 4 0.4
87.495-97.495 10 1 0.1
40

Table 1.4: Frequency table for food shopping data.

And the histogram now looks like this:

Figure 1.2: Histogram showing amount spent on food (in £) in one week by 40 households.
10 CHAPTER 1. SUMMARISING DATA

Note how the area of the rectangle over the interval from 37.495 to 57.495 on Figure 1.2 is
the sum of the areas of the two rectangles over the intervals from 37.495 to 47.495 and 47.495
to 57.495 in Figure 1.1., as explained in this video:
(Open the link to the video in a new tab or new window - usually done by using right-click.)

Exercise 1.8
Using some graph or other grid paper (if you have access to some), construct a histogram to
display the following data which refers to the amount of time (to the nearest minute) students
spent studying for a test. Before drawing the histogram, determine the frequency densities,
and decide on the scale for the histogram.

Time studied Class width Frequency Frequency density


(in minutes)
0.5-10.5 10 5 0.5
10.5-15.5 5 10
15.5-20.5 5 12
20.5-30.5 10 6
30.5-50.5 20 2

Table 1.5: Frequency table for student study data.

Try to do this yourself first, before watching this video of the solution:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
11

The Mode.
The mode is defined to be the most frequently occurring value in a set of data.
Example 1.9
The data below gives the number of students present at a statistics tutorial over 10 consecutive
weeks.
8, 5, 7, 10, 8, 6, 5, 6, 4, 8.
Arranging in ascending numerical order we have
 
4, 5, 5, 6, 6, 7, 8, 8, 8 , 10.
 

The mode of this set of data is 8. We also say that “the modal number of students present is
8”.
The mode need not be unique. If two values occur most frequently, the data are said to be
bimodal. If more than two values occur most frequently, the data are multimodal.
Note that the mode can be determined for qualitative and quantitative data, and is guaranteed
to be a value actually observed.

The Median.
The median is defined as the “middle value” when the data are arranged in ascending numer-
ical order. More precisely, for a data set with n observations,

if n is odd, then the median is x( n+1 ) ,


2

1h i
if n is even, the the median is x( n ) + x( n +1) .
2 2 2

So, for an odd number of observations the median is the value “in the middle”; for an even
number of observations the median is the average of the two “middle values”. In this video
we look at these formulas a bit more closely:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
12 CHAPTER 1. SUMMARISING DATA

Example 1.10
The data set
0, 7, 7, 19, 19, 20, 35,
has 7 data points. So the median is the 4th in this ordered list, i.e. the median is 19.
Example 1.11
Given the data from above with 10 data points,
 
4, 5, 5, 6, 6, 7 , 8, 8, 8, 10,
 

the median is the average of the 5th and 6th observations, which is
6+7 13
= = 6.5,
2 2
i.e. the median number of students attending the tutorial is 6.5.

Note that the median can be number which is not an actual observation.

The Mean.
The mean, or sample mean, or average, is calculated by adding up all of the observations
and dividing by the number of observations there are. For observations xi , where i takes
values 1, 2, 3 . . . , n, we write
n
x1 + x2 + x3 + · · · + xn 1X
sample mean = x̄ = = xi .
n n i=1

We denote the sample mean as x̄ (x with a horizontal bar on top), pronounced “x bar”.

Example 1.12
The mean for the data from above is
10
1 X 8 + 5 + 7 + 10 + 8 + 6 + 5 + 6 + 4 + 8 67
x̄ = xi = = = 6.7
10 i=1 10 10

The mean number of students attending the tutorial is 6.7 (again, not a number that we could
actually observe).

The mean is a very widely used measure, as it has many desirable statistical properties.
13

Example 1.13
A survey of subscribers to Hello magazine are asked the following question: “How many of
the last four issues have you read or looked through?”
Suppose that the following frequency distribution summarises 500 responses.

Issues read xi 0 1 2 3 4 Total


Frequency fi 15 10 40 85 350 n = 500
f i xi 0 10 80 255 1400 1745

Table 1.6: Issues read in a survey on Hello magazine.

This video discusses how we should read this table, compute the mean and the median:
(Open the link to the video in a new tab or new window - usually done by using right-click.)

And here is a written explanation:


How can we calculate the mean number of issues of Hello read by those in the survey? We
could write the data out as a list of 500 observations and calculate the mean from that.
However, it is much more convenient to use the following formula for the mean of frequency
Xm
data. Let n = fi and m be the number of classes in the table; m = 5 in the example.
i=1
Then
m 5
1X 1 X 1745
x̄ = f i xi = f i xi = = 3.49.
n i=1 500 i=1 500
So the mean number of issues of Hello read by those in the survey is 3.49.
14 CHAPTER 1. SUMMARISING DATA

Let’s now low at an example where the data are given as grouped data without the original
individual data values.
Example 1.14
Cars traveling on a road with a posted speed limit of 60 miles per hour are checked for speed
by a police radar system, giving the following frequency distribution of speeds.

Speed (mile per hour) 45-49 50-54 55-59 60-64 65-69 70-74 75-79 Total
Midpoint xi 47 52 57 62 67 72 77
Frequency fi 10 40 150 175 75 15 10 n = 475
f i xi 470 2080 8550 10850 5025 1080 770 28825

Table 1.7: Speed when checked by police radar system.

To estimate the mean (it can only be an estimate, since we do not know the original data
values!) we use the midpoints of the classes as the xi ’s in the mean of frequency data formula.
m
1X 28825
x̄ = f i xi = = 60.69.
n i=1 475

We see that the mean speed of a car checked by the police was 60.69 miles per hour – just in
excess of the speed limit!
Remark 1.15
The mean may be significantly affected by the inclusion of a mistaken observation, as the
following example shows:
Example 1.16
Suppose in Example 1.9 the observation 10 is recorded as 100 by mistake. The data is now
  
4, 5, 5, 6, 6, 7 , 8, 8, 8 , 100.
  
6+7
The mode does not change, it is still 8. The median is unchanged, it is still 2
= 6.5.
However, the mean is now:
10
1 X 4 + 5 + 5 + 6 + 6 + 7 + 8 + 8 + 8 + 100 157
x̄ = xi = = = 15.7.
10 i=1 10 10

The mean has more than doubled.


Exercise 1.17
(We will go through this one in the lecture.)
The following data refers to the annual number of deaths from tornadoes in the USA between
the years 1990 and 2000.

53, 39, 39, 33, 69, 30, 25, 67, 130, 94, 40

Calculate the mean, the median and the mode of these data.
15

Quartiles.
The median is the value that splits the ordered data into two halves. Recall that we defined
it as follows:

o If n is odd, then the median is x( n+1 ) , and is an actual data point. In this case we
2
include the median in the lower half and in the the upper half,
x( n ) +x( n +1)
o If n is even, the the median is 2 2 2 . In this case the lower half consists simply of
the lower n2 values, and the upper half of the upper n2 values.

The quartiles split the ordered data into quarters. We refer to the median as Q2 , and define
the lower quartile Q1 to be the median of the lower half, and the upper quartile Q3 to
be the median of the upper half.
!!! Be careful when reading textbooks, or using calculators or other software to compute quar-
tiles. There are a number of different ways to define the lower and upper quartile producing
different results !!!

Example 1.18
In a survey of 21 households the number of telephones used by each household are given
below.
1, 3, 4, 1, 1, 2, 1, 1, 2, 5, 1, 2, 3, 0, 2, 1, 2, 1, 3, 0, 4.
Arranging the data in ascending numerical order gives
 
0, 0, 1, 1, 1, x(6) = 1, 1, 1, 1, 1, x(11) = 2 , 2, 2, 2, 2, x(16) = 3, 3, 3, 4, 4, 5.
 

There are n = 21 data items. Twenty-one is an odd number, and so the median number of
telephones per household is data item Q2 = x( n+1 ) = x(11) = 2. The lower quartile, Q1 is the
2
median of the data items x(1) , x(2) , x(3) , . . . , x(11) , which is x(6) = 1. The upper quartile, Q3 is
the median of the data items x(11) , x(2) , x(3) , . . . , x(21) , which is x(16) = 3.
This video explains the example above in more details:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
16 CHAPTER 1. SUMMARISING DATA

Example 1.19
The data below gives the monthly starting salaries (in $) of 12 business school graduates.

2850, 2950, 3050, 2880, 2755, 2710, 2890, 9130, 2940, 3325, 2920, 2880.
x(6) + x(7)
Since 12 is even, the median Q2 = = 2905.
2
x(3) + x(4)
The lower quartile Q1 is the median of x(1) , x(2) , . . . , x(6) , which is Q1 = = 2865.
2
x(9) + x(10)
The upper quartile Q3 is the median of x(7) , x(8) , . . . , x(12) , which is Q3 = = 3000.
2
See whether you can get the same answers. This video goes through the process of determining
Q1 , Q2 , and Q3 for this example:
(Open the link to the video in a new tab or new window - usually done by using right-click.)

We are now introducing a number of concepts which measure the dispersion, i.e. the spread
or variability of a quantitative data set.

The Range.
The range is simply the difference between the largest and the smallest observation, i.e. in
a data set with n observations

range = (largest observation)−(smallest observation) = x(n) − x(1) .


In Example 1.19, the range is $9130 − $2710 = $6420. Is this a reasonable representation of
the variability? – probably not, as eleven of the twelve salaries lie between $2710 and $3325,
with one data point of $9130 being much larger than all the others.

The Interquartile Range.


The range can be inflated by the presence of a single very large or very small value. It is
better to give the interquartile range (IQR) which is the range of the middle 50% of the data
values, i.e. the difference between the upper and the lower quartile; this is

IQR = Q3 − Q1 .

For Example 1.19, the interquartile range IQR is $3000 − $2865 = $135.
17

The Variance and Standard Deviation.


The (sample) variance and (sample) standard deviation are the most widely used measures
of variation or dispersion of a data set, as they have desirable statistical properties.
The sample variance measures how far the data points are spread out from the mean, and
for a set with n data points x1 , . . . xn , is defined as
n
2 1 X
s = (xi − x̄)2 .
n − 1 i=1
(The reason why the sample variance is abbreviate by a square will become clear further below.)
If s2 is small, the data items lie fairly close to the mean.
If s2 is large, the data items are widely spread about the mean.
The formula above is discussed in more detail in this video:
(Open the link to the video in a new tab or new window - usually done by using right-click.)

Example 1.20
The data given in in Example 1.9 with the number of students present at a statistics tutorial
over 10 consecutive weeks was

8, 5, 7, 10, 8, 6, 5, 6, 4, 8,

with a mean of 6.7.


So the sample variance for these data is
1
s2 = 9
[(8 − 6.7)2 + (5 − 6.7)2 + (7 − 6.7)2 + (10 − 6.7)2 + (8 − 6.7)2
+ (6 − 6.7)2 + (5 − 6.7)2 + (6 − 6.7)2 + (4 − 6.7)2 + (8 − 6.7)2 ] = 3.3.

For larger data sets, this involves a lot of calculations. Here is a more “calculator friendly”
form of the sample variance:
 !2 
n n
1  X 1 X 
s2 = x2i − xi .
n − 1  i=1 n i=1 
18 CHAPTER 1. SUMMARISING DATA

One can show that the two formulas produce the same results. We will show this for n = 3
in this video:
(Open the link to the video in a new tab or new window - usually done by using right-click.)

So in the example from above, we could compute the sample variance instead as follows:
n
X
x2i = 82 + 52 + 72 + 102 + 82 + 62 + 52 + 62 + 42 + 82 = 479, and
i=1

n
X
xi = 8 + 5 + 7 + 10 + 8 + 6 + 5 + 6 + 4 + 8 = 67, so
i=1

672
 
2 1 1
s == 479 − = (479 − 448.9) = 3.3.
9 10 9

The sample standard deviation is defined as



s= s2 .
It is measured in the same units as the original data and is therefore easier to interpret.
√ √
For the example from above s = s2 = 3.3 = 1.8; so the standard deviation is 1.8 students.

Example 1.21
Let us return to the survey on Hello subscribers from earlier in the chapter.

Number read xi 0 1 2 3 4 Total


Frequency fi 15 10 40 85 350 n = 500
f i xi 0 10 80 255 1400 1745
fi x2i 0 10 160 765 5600 6535

Table 1.8: Issues read in a survey on Hello magazine.

Here is how we calculate the sample variance s2 for these data, using the frequencies fi .
19

m
X
We have n = fi = 500. So
i=1
 !2 
m m
17452
 
2 1 X 1 X  1
s = fi x2i − f i xi = 6535 − = 0.89.
n − 1  i=1 n i=1
 499 500

In this video we briefly explain how to apply the formula for the given frequency table:
(Open the link to the video in a new tab or new window - usually done by using right-click.)

Example 1.22
We now look at the police speed check example, and see whether you can follow how the
formula below was used.

Speed (mile per hour) 45-49 50-54 55-59 60-64 65-69 70-74 75-79 Total
Midpoint xi 47 52 57 62 67 72 77
Frequency fi 10 40 150 175 75 15 10 n = 475
f i xi 470 2080 8550 10850 5025 1080 770 28825
fi x2i 22090 108160 487350 672700 336675 77760 59290 1764025

Table 1.9: Speed when checked by police radar system.

Here
 !2 
m m
288252
 
1 X 1 X  1
2
s = fi x2i − f i xi = 1764025 − = 31.23 mph2 .
n − 1  i=1 n i=1
 474 475
20 CHAPTER 1. SUMMARISING DATA

Exercise 1.23
(We will go through this one in the lecture.)
The data below refer to the number of brothers and sisters a sample of students have. You
will need to calculate the values that go in place of the constants B1 , B2 , B3 and B4 .

Number brothers/sisters, xi 0 1 2 3 4 5 Total


Frequency, fi 5 12 8 3 0 1 29
f i xi 0 12 16 B1 0 5 B2
fi x2i 0 12 32 27 0 B3 B4

Table 1.10: Data on numbers of brothers and sisters.

Also calculate the mean, the median, the sample variance, and the standard deviation for
these data.

Skewness and Outliers.


If a data set is approximately symmetric, the median Q2 is roughly equally spaced between
the upper quartile Q3 and the lower quartile Q1 . If the median is much closer to Q1 than to
Q3 , then the data set is positively skewed (containing some rather high values). If the median
is much closer to Q3 than to Q1 , then the data set is negatively skewed (containing some
rather low values).
The quartile coefficient of skewness is given by

(Q3 − Q2 ) − (Q2 − Q1 )
.
Q3 − Q1
Q3 − 2Q2 + Q1
Note that this is the same as .
Q3 − Q1
This coefficient takes values between −1 and +1. A value near zero indicates symmetry, a
negative value indicates that the data is negatively skewed, a positive value indicates that the
data is positively skewed. The following video explains how the above formula represents the
skewness of the data:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
21

Box Plots.
A useful way of comparing two data sets is to produce box plots. To construct a box plot you
must first calculate the median Q2 and the quartiles, Q1 and Q3 . The general form of a box
plot is shown below: This video gives a brief introduction to box plots:

Smallest Largest
observation observation
Q Median Q
1 3

Figure 1.3: A box plot.

(Open the link to the video in a new tab or new window - usually done by using right-click.)

Example 1.24
The following data gives systolic blood pressure of 12 smokers and 12 non-smokers.

Smokers 122 146 120 114 124 126 118 128 130 134 116 130
Non-smokers 114 134 114 116 138 110 112 116 132 126 108 116

Table 1.11: Blood pressures of smokers and non-smokers.

For the smokers the median Q2 = 125, the lower quartile Q1 = 119, and the upper quartile
Q3 = 130.
For the non-smokers, the median Q2 = 116, the lower quartile Q1 = 113, and the upper
quartile Q3 = 129.
22 CHAPTER 1. SUMMARISING DATA

The two box plots are as follows.

Smokers

100 Non−smokers

110

120

130

140

150
Blood pressure

Figure 1.4: Box plots comparing blood pressure in smokers and non-smokers.

We see that the blood pressures of the non-smokers tend to be lower than those of the
smokers (the box in the non-smokers’ plot is shifted to the left compared to the smokers’ plot).
However, blood pressure amongst non-smokers appears to be more variable than amongst
smokers (the box in the non-smokers’ plot is wider than the box in the smokers’ plot).

Skewness for the smokers:


(Q3 − M ) − (M − Q1 ) (130 − 125) − (125 − 119)
= = −0.09.
Q3 − Q1 130 − 119
The blood pressures of the smokers are slightly negatively skewed; Q1 is a bit further away
from Q2 than Q3 is.

Skewness for the non-smokers:


(Q3 − M ) − (M − Q1 ) (129 − 116) − (116 − 113)
= = 0.62.
Q3 − Q1 129 − 113
The blood pressures of the non-smokers are positively skewed; Q3 is a further away from Q2
than Q1 is. (i.e. there are some non-smokers with rather high blood pressures compared to
the rest).

An outlier is an extremely high or extremely low observation. An outlier may be a data item
that has been incorrectly recorded (in which case it should be removed from the data set), or
it may be a genuine observation (but unusual in some way). An observation is identified as
an outlier if it is less than
3
Q1 − (Q3 − Q1 ),
2
or greater than
3
Q3 + (Q3 − Q1 ).
2

Outliers for the smokers: (for the data from above)


3 3 3 3
Q1 − (Q3 −Q1 ) = 119− (130−119) = 102.5, and Q3 + (Q3 −Q1 ) = 130+ (130−119) = 146.5.
2 2 2 2
23

Hence there are no outliers in the smokers’ data.

Outliers for the non-smokers:


3 3 3 3
Q1 − (Q3 −Q1 ) = 113− (129−113) = 89, and Q3 + (Q3 −Q1 ) = 130+ (129−113) = 153.
2 2 2 2
There are also no outliers in the non-smokers’ data.

This video explains the concept of an outlier using box plots:


(Open the link to the video in a new tab or new window - usually done by using right-click.)

Exercise 1.25
(We will go through this one in the lecture.)
Here are the case prices (in $) for 13 wines produced in the USA.

52, 66, 70, 80, 95, 100, 110, 112, 115, 118, 123, 143, 151.

(i) Calculate the Quartile Coefficient of Skewness.

(ii) Identify any outliers.

(iii) Construct a box plot to display the data, using graph or grid paper (if available).
CHAPTER 2

Correlation and Regression

When looking at data we often are interested in whether there is a relationship or dependence
between two or more variables. For example, you could be looking for evidence that the height
of the parents and the height of the children are related, or that there is some connection
between the age of people and a certain illness.
We will focus on the relationship between two variables, which we will typically call x and y.
If one of the two variables can be identified as potentially affecting the other, we would name
that variable x. This is not always the case though.

Scatter Diagrams.
When dealing with two (or more) variables it is always useful to produce a scatter plot. The
variable x is plotted on the horizontal axis and the variable y is plotted on the vertical axis.
Example 2.1
The variable x is the number of TV commercials for a product shown during a week, and the
variable y is the volume of sales (in £100s) during the following week:

Number of commercials, xi 2 5 1 3 4 1 5 3 4 2
Sales (£100s), yi 50 57 41 54 54 38 63 48 59 46

Table 2.1: Data on commercials shown and sales.

24
25

70
60
Sales
50
40
30

1 2 3 4 5
N m r f mm ri i l h wn

Figure 2.1: Scatter diagram of sales against number of commercials shown.

We can see from the scatter plot that as the number of commercials increase, the sales increase
as well. We say that the number of commercials and the sales seem to be “correlated”. The
correlation coefficient, which we define in the next section, is a numerical measure for how
strong this “correlation” is.

Correlation.
The correlation coefficient r is a number between −1 and +1 that quantifies how closely two
variables x and y follow a linear relation, and whether this relationship is positive or negative.
It is defined as
Sxy
r=p
Sxx Syy
where
n
X n
X n
X
Sxy = (xi − x̄)(yi − ȳ) , Sxx = (xi − x̄)2 , Syy = (yi − ȳ)2 ,
i=1 i=1 i=1

and n is the number of observations of the variables x and y.


This video explains what is what in these formulas:
(Open the link to the video in a new tab or new window - usually done by using right-click.)

o If r > 0 then there is a positive linear relationship between the xi and the yi , meaning
that as the xi get larger, the yi tend to get larger as well at a (more or less) constant
rate.
26 CHAPTER 2. CORRELATION AND REGRESSION

o If r < 0 then there is a negative linear relationship between the xi and the yi , meaning
that as the xi get larger, the yi tend to decrease at a (more or less) constant rate.
The closer the correlation coefficient is to +1 or −1, respectively, the stronger/more
perfect this linear relationship is.

o A correlation coefficient of +1 means there is a perfect positive linear relationship be-


tween the variables x and y. As x increases, y increases at a constant rate.
The example for this is when yi = cxi for all i for some positive constant c .
y

Figure 2.2: The points lie on a straight line with a positive gradient.

o A correlation coefficient of −1 means that there is a perfect negative linear relationship


between the variables x and y. That is to say that as x increases, y decreases at a
constant rate.
The example for this is when yi = cxi for all i for some negative constant c .
y

Figure 2.3: The points lie on a straight line with a negative gradient.

In this video we show why the correlation coefficient is +1 and -1 in these special cases:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
27

o A correlation coefficient of 0 means there is no linear relationship between the variables


x and y. The scatter plot in this case could for example look like this.

Figure 2.4: Zero correlation.

However, these cases where the correlation coefficient r is equal to 1, −1, or 0 are exceptional;
for real data, the correlation coefficient would almost never be equal to 1, −1, or 0, it would
typically be somewhere in between.

To understand a bit better why the formula for the correlation coefficient has the form it has,
consider the following example:
Let x1 , . . . , xn be the individual marks of students in a mock exam, and let y1 , . . . , yn be
the individual marks of those students in the final exam. We expect that there is a strong
positive relationship/correlation between the variable x and y, i.e. that a high mark in the
mock exam will most likely result in a high mark in the final exam, and similarly for low
marks. So we anticipate that a positive value for xi − x̄ (high mark in the mock exam) goes
together with a positive value for yi − ȳ (high mark in the final exam), and that a negative
value for xi − x̄ (low mark in the mock exam) goes together with a negative value for yi − ȳ
(low mark in the final exam). In both cases this results in a positive value for (xi − x̄)(yi − ȳ)
in the numerator. The denominator simply scales the quantity to values between -1 and 1.
Hence, correlation coefficients near −1 or +1 are indicative of a strong linear relationship
between the variables x and y, and correlation coefficients near zero indicates there is no, or
a very weak linear relationship between the variables. Values outside the range of -1 to 1
indicate that a mistake in the calculation has occurred.
It is important to note that the correlation coefficient does not tell us anything about
the slope or gradient of the linear relationship, if there is one.

Fact 2.2
Notice the similarity of the formulas for Sxx and Syy to the formulas for the sample variance.
As with the formula for the variance, there are more calculator-friendly versions of the formulas
for Sxy , Sxx , and Syy . These are:
28 CHAPTER 2. CORRELATION AND REGRESSION

n n
! n
!
X 1 X X
Sxy = xi yi − xi yi ,
i=1
n i=1 i=1
n n
!2
X 1 X
Sxx = x2i − xi ,
i=1
n i=1
n n
!2
X 1 X
Syy = yi2 − yi .
i=1
n i=1

Let’s now look at some more realistic examples:


Exercise 2.3
The data below refers to the horsepower and miles per gallon (MPG) for several cars produced
in 2001. Draw a scatter diagram of these data, make a prediction about the correlation
coefficient, and then compute the correlation coefficient.

Car Horsepower xi MPG yi


Audi A4 170 22
Honda Civic 127 29
Lexus 300 215 21
Mazda MPV 170 18
Toyota Camry 194 21
VW Beetle 115 29

Table 2.2: Car horsepower and miles per gallon data.

Try this exercise yourself first before looking at the answer in this video:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
29

Example 2.4
Consider the following data which refers to the the amount of fat (in grams) and the number
of calories in a sample of fast food beef burgers. Draw a scatter diagram of these data, and
compute the correlation coefficient using the formulas above.

Fat (grams), xi 19 31 34 35 39 39 43
Calories, yi 410 580 590 570 640 680 660

Table 2.3: Beef burger data.

We will look at this example in the lecture, but you could try it yourself before-hand.

A Misuse of Correlation.

Consider the following data.

xi −1 −0.75 −0.50 −0.25 0 0.25 0.50 0.75 1


yi 0 0.66 0.87 0.97 1 0.97 0.87 0.66 0

Table 2.4: Example data.


1.0
0.8
0.6
y

0.4
0.2
0.0

−1.0 −0.5 0.0 0.5 1.0

Figure 2.5: Scatter diagram.

There appears to be a relationship between the variables, as indicated by the dashed


curve. However, as the relationship is not linear, the correlation coefficient cannot be used to
quantify it. Here

9
X 9
X 9
X 9
X 9
X
xi = 0, x2i = 3.75, n = 9, yi = 6, yi2 = 5.27, xi yi = 0.
i=1 i=1 i=1 i=1 i=1
30 CHAPTER 2. CORRELATION AND REGRESSION

9 9
!2
X 1 X 02
Sxx = x2i − xi = 3.75 − = 3.75,
i=1
9 i=1
9
9 9
!2
X 1 X 62
Syy = yi2 − yi
= 1.27, = 5.27 −
i=1 i=1
9 9
9 9
! 9
!
X 1 X X 0×6
Sxy = x i yi − xi yi = 0 − = 0.
i=1
9 i=1 i=1
9

Sxy 0
r=p =√ = 0.
Sxx Syy 3.75 × 1.27
Here r = 0, indicating that a linear relationship is not appropriate. Without plotting the
data we might have mistakenly concluded that there was no relationship between the two
variables.

Another Misuse of Correlation.


There are other difficulties in using a correlation coefficient to quantify a relationship, without
first referring to suitable plots of the data. Consider the following example in which a value
of 1 has been mistakenly recorded as 100.

xi 1 2 3 4 5
yi 5 4 3 2 100

Table 2.5: A data set with an error.


100
80
60
y

40
20
0

1 2 3 4 5

Figure 2.6: Scatter diagram.

5
X 5
X 5
X 5
X 5
X
xi = 15, x2i = 55, n = 5, yi = 114, yi2 = 10054, xi yi = 530.
i=1 i=1 i=1 i=1 i=1
31

5 5
!2
X 1 X 152
Sxx = x2i − xi = 55 − = 10,
i=1
5 i=1
5
5 5
!2
X 1 X 1142
Syy = yi2 − yi = 10054 − = 7454.8,
i=1
5 i=1
5
5 5
! 5
!
X 1 X X 15 × 114
Sxy = xi y i − xi yi = 530 − = 188.
i=1
5 i=1 i=1
5

Sxy 188
r=p =√ = 0.69.
Sxx Syy 10 × 7454.8
A correlation coefficient of 0.69 would indicate that there is a positive linear relationship
between the variables x and y. However, if the value of 100 had been correctly recorded
as 1 then we would see that there is a perfect negative linear relationship between the
variables, and the correlation coefficient would take the value −1. It is easy to make a mistake
in recording data, or entering it into a computer. Failure to spot such a mistake can lead to
incorrect conclusions!

IMPORTANT! Correlation Does Not Imply Causation.


Two variables being correlated does not imply that changes in one variable cause changes in
the other. Consider the example below.
15
Price of rum in Havana

10
5

6000 8000 10000 12000 14000

Salaries of Presbyterian ministers

Figure 2.7: Scatter diagram of rum prices in Havana against salaries of ministers in Mas-
sachusetts.

There is a strong positive correlation between the salaries of ministers in Massachusetts


and the price of rum in Havana. Which is the cause and which the effect? In other words, are
the ministers benefiting from the rum trade or supporting it? Both conclusions are silly. It is
easy to see that both figures are growing because of the influence of a third factor: the historic
and world-wide rise in the price level of most commodities. We should always be aware that
there may be a third, unrecorded, variable that is creating the correlation between two
recorded variables.
32 CHAPTER 2. CORRELATION AND REGRESSION

If you are interested in seeing other examples of spurious correlations, have a look at this web
site: https://www.tylervigen.com/spurious-correlations.

Linear Regression

Linear regression is an important and widely used technique to model relationships between
variables. It is used for various purposes, for example to explain and quantify a relationship,
but also to make predictions.
Suppose the variable x can affect the variable y and we want to model the relationship in
terms of fitting the most appropriate straight line to the data. We use the method of least
squares regression.
We fit the straight line relationship y = a + bx to the data. In general it is not possible to
find a line that fits the data exactly. At each point xi there will be an error (residual) in
our estimated value of yi which we denote as ri = yi − (a + bxi ). One sensible approach is

y=a+bx

r5

r4

r3
y

r2
r1

Figure 2.8: Illustrating regression.

to choose the straight line with equation y = a + b x that minimises the sum of the ri2 (i.e.
X n
makes ri2 as small as possible).
i=1

Mathematically this happens when

Sxy
b= , a = ȳ − bx̄,
Sxx

with Sxy and Sxx defined previously.


In this video we look briefly at the general equation of lines, and then at the regression line
a bit more closely:
(Open the link to the video in a new tab or new window - usually done by using right-click.)
2.1. LINEAR REGRESSION 33

Linear regression has many practical uses. One is to make predictions. The regression line
provides a model which allows us to plug in values for x for which we do not have accompanying
values for y, and so we can make predictions for the values of y.
Example 2.5
Suppose that we are interested in the relationship between the number of staff at an insurance
firm and the number of policies the firm issues (per month). It seems sensible to suggest that
firms with more staff will issue a larger number of insurance policies. How can we verify and
quantify this relationship?
The following data refers to 25 independent insurance firm.
Number of staff, xi 3 6 5 8 16 15 23 5 13
Number of policies, yi 44 144 150 236 739 970 2371 309 679
Number of staff, xi 4 19 33 19 10 16 22 2 3
Number of policies, yi 26 1272 3246 1904 357 1080 1027 45 62
Number of staff, xi 2 22 2 18 21 24 9
Number of policies, yi 68 2507 138 502 1501 2750 192

Table 2.6: Data on a sample of 25 insurance brokers.

The first step when looking for a relationship between two variables is to plot them against
each other.
3000
2500
Number of policies
2000
1500
1000
500
0

5 10 15 20 25 30
N m r f ff

Figure 2.9: Scatter diagram of number of policies against number of staff for 25 insurance
firms.
34 CHAPTER 2. CORRELATION AND REGRESSION

It seems that there is indeed a relationship between the two variables; firms with a larger
number of staff do issue more insurance policies! The plot of the data confirms that it
is appropriate to fit a straight linear between the variables. We begin by calculating the
correlation coefficient, r.

xi yi x2i yi2 xi y i
3 44 9 1936 132
6 144 36 20736 864
5 150 25 22500 750
8 236 64 55696 1888
16 739 256 546121 11824
15 970 225 940900 14550
23 2371 529 5621641 54533
5 309 25 95481 1545
13 679 169 461041 8827
4 26 16 676 104
19 1272 361 1617984 24168
33 3246 1089 10536516 107118
19 1904 361 3625216 36176
10 357 100 127449 3570
16 1080 256 1166400 17280
22 1027 484 1054729 22594
2 45 4 2025 90
3 62 9 3844 186
2 68 4 4624 136
22 2507 484 6285049 55154
2 138 4 19044 276
18 502 324 252004 9036
21 1501 441 2253001 31521
24 2750 576 7562500 66000
9 192 81 36864 1728
320 22319 5932 42313977 470050

Table 2.7: Intermediate calculations.

25
X 25
X
xi = 320, x2i = 5932, n = 25, x̄ = 12.8.
i=1 i=1
25
X 25
X 25
X
yi = 22319, yi2 = 42313977, xi yi = 470050, ȳ = 892.76.
i=1 i=1 i=1
2.1. LINEAR REGRESSION 35

25 25
!2
X 1 X 3202
Sxx = x2i − xi = 5932 − = 1836,
i=1
25 i=1
25
25 25
!2
X 1 X 223192
Syy = yi2 − yi = 42313977 − = 22388467,
i=1
25 i=1
25
25 25
! 25
!
X 1 X X 320 × 22319
Sxy = xi y i − xi yi = 470050 − = 184366.8.
i=1
25 i=1 i=1
25

Sxy 184366.8
r=p =√ = 0.91.
Sxx Syy 1836 × 22388467

We want to use the number x of staff a firm has to predict the number y of policies they
issue. We will now fit the regression line y = a + bx. Here
Sxy 184366.8
b= = = 100.42,
Sxx 1836
a = ȳ − bx̄ = 892.76 − 100.42 × 12.8 = −392.59.
We have the regression line y = −392.59 + 100.42x.

We interpret the parameter b (the gradient of the regression line) to mean that, for every
extra member of staff, a firm issues 100.42 extra insurance policies per month (or, phrased in
a more sensible manner, each extra member of staff issues approximately 100 extra policies).
Remember how the gradient represents the change in y as we increase x by 1.

The value of the parameter a here does not have a sensible physical interpretation. What
it appears to suggest is that a firm with no staff (!) issues −393 insurance policies, clearly
nonsense!

This illustrates an important point. Do not attempt to extend results too far outside the
range of the data. The smallest firm we have recorded has two staff, we should not try to
draw conclusions for firms much smaller than this.

The next step is to plot our regression line on the same axes as the original data, to check
that we have not made a mistake, and that the relationship seems appropriate.

The easiest way to add the regression line to the plot is to calculate the value of y for several
points on the x-axis, joining the resulting points by a straight line.

x = 5, y = −392.59 + 100.42 × 5 = 109.51.


x = 10, y = −392.59 + 100.42 × 10 = 611.61.
x = 15, y = −392.59 + 100.42 × 15 = 1113.71.
x = 20, y = −392.59 + 100.42 × 20 = 1615.81.
36 CHAPTER 2. CORRELATION AND REGRESSION

We add these points to our original plot and join them by a line.

3000
2500
Number of policies
2000
1500
1000
500
0

5 10 15 20 25 30
N m r f ff

Figure 2.10: Scatter diagram of number of policies against number of staff for 25 insurance
firms

In this video we discuss a slightly different way to draw the regression line, and discuss
the regression line briefly:
(Open the link to the video in a new tab or new window - usually done by using right-click.)

How can we make use of this regression line? Suppose we are interested in a firm that has 27
staff, and want to estimate the number of policies it will issue in a month. If x = 27, then we
predict
y = −392.59 + 100.42 × 27 = 2318.75.
We round the answer to the nearest whole number and report that a firm with 27 staff will
issue approximately 2319 insurance policies per month.

Example 2.6
Using the data and the diagram you drew in Exercise 2.4, compute the parameters a and b
in the regression line y = a + bx and add your regression line to the scatter plot.
We will look at this example in the lecture, or you can try it yourself.

You might also like