1 Statistical modelling
Statistical modelling
Example: When a die is rolled, we say that the probability of each number is . This is a
statistical model, but the assumption that each face is equally likely might not be true.
Suppose the die is weighted to increase the chance of a six. We might then find, after
experimenting, that the probability of a six is and the probability of a one is , with the
probability of other faces remaining at . In this case we have refined, or improved, the
model to give a truer picture.
Example: The heights of a large group of adults are measured. The mean is 172⋅3 cm and the
standard deviation is 12⋅4 cm.
It is thought that the general shape of the histogram can be modelled by the curve
·
f (x) =
· √
This might not give a true picture, in which case we would have to change the equation, or
refine the model.
Definition
A statistical model is a simplification of a real world situation. It can be used to make
predictions about a real world problem. By analysing and refining the model an improved
understanding may be obtained.
Advantages
• the model is quick and easy to produce
• the model helps our understanding of the real world problem
• the model helps us to make predictions
• the model helps us to control a situation – e.g. railway timetables, air traffic control etc.
Disadvantages
• the model simplifies the situation and only describes a part of the real world problem.
• the model may only work in certain situations, or for a particular range of values.
2 Representation of sample data
Variables
Qualitative variables
Non-numerical - e.g. red, blue or long, short etc.
Quantitative variables
Numerical - e.g. length, age, time, number of coins in pocket, etc
Continuous variables
Can take any value within a given range - e.g. height, time, age etc.
Discrete variables
Can only take certain values - e.g. shoe size, cost in £ and p, number of coins.
Frequency distributions
Frequency tables
A list of discrete values and their frequencies.
Example: The number of M&M s is counted in several bags, and recorded in the frequency
table below:
number of M&M s 37 38 39 40 41 42 43
frequency 3 8 11 19 13 7 2
Cumulative frequency
Add up the frequencies as you go down the list
number of M&M s 37 38 39 40 41 42 43
frequency 3 8 11 19 13 7 2
cumulative frequency 3 11 22 41 54 61 63
Stem and leaf & back-to-back stem and leaf diagrams
Line up the digits on the leaves so that it looks like a bar chart.
Add a key; e.g. 5|2 means 52, or 4|3 means 4⋅3 etc.
Comparing two distributions from a back to back stem and leaf diagram.
A B 8/1 = 81
8 1 2 2 4 5 6 7 8
3 1 0 7 2 3 5 6 7 7 8 8 9 9 9
3 2 2 1 1 6 1 3 4 5 5 6 6 7 8 8 8 9 9
7 6 5 5 3 3 5 0 0 1 2 3 4 5 6 8
9 8 7 4 3 0 0 4
8 7 5 3 1 1 3
9 9 7 4 2 2
5 4 3 2 1 1
8 7 0
1. The values in A are on average smaller than those in B
2. The values in A are more spread out than those in B.
Grouped frequency distributions
Class boundaries and widths
When deciding class boundaries you must not leave a gap between one class and
another, whether dealing with continuous or discrete distributions.
For discrete distributions avoid leaving gaps between classes by using class boundaries as
shown below:
X 0, 1, 2, 3, 4, 5, 6, 7, …
Class Class Class boundaries
interval – as given contains without gaps
0–4 0, 1, 2, 3, 4 0–4
5–9 5, 6, 7, 8, 9 4 –9
10 – 12 10, 11, 12 9 – 12
etc
For continuous distributions the class boundaries can be anywhere.
Cumulative frequency curves for grouped data
class class cumulative
frequency class
interval boundaries frequency
0-4 0 to 4 ½ 27 ≤4½ 27
5–9 4 ½ to 9 ½ 36 ≤9½ 63
10 – 19 9 ½ to 19 ½ 54 ≤ 19 ½ 117
20 – 29 19 ½ to 29 ½ 49 ≤ 29 ½ 166
30 – 59 29 ½ to 59 ½ 24 ≤ 59 ½ 190
60 –99 59 ½ to 99 ½ 10 ≤ 99 ½ 200
Plot points at ends of intervals, (4 , 27), (9 , 63), (19 , 117) etc. and join points with a
smooth curve.
Histograms
Plot the axes with a continuous scale as normal graphs.
There are no gaps between the bars of a histogram.
Area equals frequency.
Note that the total area under a frequency histogram is N, the total number
and the area from a to b is the number of items between a and b.
To draw a histogram, first draw up a table showing the class intervals, class boundaries, class
widths, frequencies and then heights = – as shown below:
class class class
frequency height
interval boundaries width
0-4 0 to 4 4 27 6
5–9 4 to 9 5 36 7⋅2
10 – 19 9 to 19 10 54 5⋅4
20 – 29 19 to 29 10 49 4⋅9
30 – 59 29 to 59 30 24 0⋅8
60 –99 59 to 99 40 10 0⋅25
Example: A grouped frequency table for the weights of adults has the following entries:
weight kg … … 50 – 60 … 70 – 85 …
frequency … … 60 … 20 …
In a histogram, the bar for the class 50 – 60 kg is 2 cm wide and 9 cm high.
Find the width and height of the bar for the 70 – 85 kg class.
Solution: 50 – 60 is usually taken to mean 50 ≤ weight < 60
The width of the 50 – 60 class is 10 kg ≡ 2 cm
⇒ width of the 70 – 85 class is 15 kg ≡ × 2 = 3 cm
The area of the 50 – 60 bar is 2 × 9 = 18 cm2 ≡ frequency 60
⇒ the frequency of the 70 – 85 bar is 20 ≡ an area of × 18 = 6 cm2
⇒ the height of the 70 – 85 bar is area ÷ width = 6 ÷ 3 = 2 cm.
Answer width of 70 – 85 kg bar is 3 cm, and height is 2 cm.
3 Mode, mean (and median)
Mode
The mode is the value, or class interval, which occurs most often.
Mean
The mean of the values x1, x2, … , xn with frequencies f1, f2, … , fn the mean is
1
, where
Example: Find the mean for the following table showing the number of children per family.
Solution: Number of children Frequency
x f xf
0 5 0
1 8 8
2 12 24
3 18 54
4 9 36
5 4 20
56 142
Σ xi fi = 142, and N = Σ fi = 56
⇒ = = 2⋅54 to 3 S.F.
In a grouped frequency table you must use the mid-interval value.
Example: The table shows the numbers of children in prep school classes in a town.
Solution: Number Mid-interval Frequency
of children value
x f xf
1 - 10 5⋅5 5 27⋅5
11 - 15 13 8 104
16 – 20 18 12 216
21 - 30 25⋅5 18 459
31 - 40 35⋅5 11 390⋅5
54 1197
Σ xi fi = 1197, and N = Σ fi = 54
⇒ = = 22⋅2 to 3 S.F.
Coding
The weights of a group of people are given as x1, x2, … xn in kilograms. These weights are
now changed to grammes and given as t1, t2, … tn .
In this case ti = 1000 × xi – this is an example of coding.
20
Another example of coding could be ti = .
Coding and calculating the mean
20
With the coding, ti = , we are subtracting 20 from each x-value and then dividing the
result by 5.
We first find the mean for ti, and then we reverse the process to find the mean for xi
⇒ we find the mean for ti, multiply by 5 and add 20, giving = 5 + 20
20
Proof: ti = ⇒ xi = 5ti + 20
1 1
5 20
5 20
1
5 20 since and
165
Example: Use the coding ti = to find the mean weight for the following
distribution.
Weight, kg Mid-interval Coded value Frequency
165
xi ti = fi ti fi
140 - 150 145 –2 9 –18
150 - 160 155 –1 21 –21
160 - 170 165 0 37 0
170 - 180 175 1 28 28
180 - 190 185 2 11 22
106 11
⇒ =
165
and ti = ⇒ = 10 + 165 = 10 × + 165 = 166⋅0377358
⇒ mean weight is 166⋅04 kg to 2 D.P.
Here the coding simplified the arithmetic for those who like to work without a calculator!
Median
The median is the middle number in an ordered list. Finding the median is explained in the
next section.
When to use mode, median and mean
Mode
You should use the mode if the data is qualitative (colour etc.) or if quantitative (numbers)
with a clearly defined mode (or bi-modal). It is not much use if the distribution is fairly even.
Median
You should use this for quantitative data (numbers), particularly when there are extreme values
(outliers).
Mean
This is for quantitative data (numbers), and uses all pieces of data. It gives a true measure, but
is affected by extreme values (outliers).
4 Median (Q2), quartiles (Q1, Q3) and percentiles
Discrete lists and discrete frequency tables
To find medians and quartiles
1. Find k = (for Q2), (for Q1), (for Q3).
2. If k is an integer, use the mean of the kth and (k + 1)th numbers in the list.
3. If k is not an integer, use the next integer up, and find the number with that position in
the list.
Interquartile range
The interquartile range, I.Q.R., is Q3 – Q1.
Discrete lists
A discrete list of 10 numbers is shown below:
x 11 13 17 25 33 34 42 49 51 52
n = 10 for Q1, =2 so use 3rd number, ⇒ Q1 = 17
for Q2, = 5 so use mean of 5th and 6th, ⇒ Q2 = 33 median
for Q3 , =7 so use 8th number, ⇒ Q3 = 49
The interquartile range, I.Q.R., is Q3 – Q1 = 49 – 17 = 32.
Discrete frequency tables
x 5 6 7 8 9 10 11 12
f 3 6 8 10 9 8 6 4
cum freq 3 9 17 27 36 44 50 54
n = 54 for Q1, = 13 so use 14th number, ⇒ Q1 = 7
for Q2, = 27 so use mean of 27th and 28th, ⇒ Q2 = 8 median
for Q3 , = 40 so use 41st number, ⇒ Q3 = 10
The interquartile range, I.Q.R., is Q3 – Q1 = 10 – 7 = 3.
Grouped frequency tables, continuous and discrete data
To find medians and quartiles
1. Find k = (for Q2), (for Q1), (for Q3).
2. Do not round k up or change it in any way.
3. Use linear interpolation to find median and quartiles – note that you must use the
correct intervals for discrete data (start at the s).
Grouped frequency tables, continuous data
class boundaries frequency cumulative frequency
0≤x<5 27 27
5 to 10 36 63
10 to 20 54 117
20 to 30 49 166
30 to 60 24 190
60 to 100 12 202
With continuous data, the end of one interval is the same as the start of the next – no gaps.
To find Q1, n = 202 ⇒ = 50 do not change it
Q1 – 5
5
class
boundaries 5 Q1 10
cumulative
frequencies
27 50⋅5 63
36
23⋅5
From the diagram ⇒ Q1 = 5 + 5 × = 8⋅263888889 = 8⋅26 to 3 S.F.
To find Q2, n = 202 ⇒ = 101 do not change it
class
boundaries 10 Q2 20
cumulative
frequencies
63 101 117
From the diagram ⇒ Q2 = 10 + 10 × = 17⋅037…= 17⋅0 to 3 S.F.
Similarly for Q3, = 151⋅5, so Q3 lies in the interval (20, 30)
. .
⇒ ⇒ Q3 = 20 + 10 × = 27⋅0408… = 27⋅0 to 3 S.F.
Grouped frequency tables, discrete data
The discrete data in grouped frequency tables is treated as continuous.
1. Change the class boundaries to the 4 , 9 etc.
2. Proceed as for grouped frequency tables for continuous data.
class class cumulative
frequency
interval boundaries frequency
0–4 0 to 4 ½ 25 25
5–9 4 ½ to 9 ½ 32 57
10 – 19 9 ½ to 19 ½ 51 108
20 – 29 19 ½ to 29 ½ 47 155
30 – 59 29 ½ to 59 ½ 20 175
60 –99 59 ½ to 99 ½ 8 183
To find Q1, n = 183 ⇒ = 45⋅75
class
boundaries 4⋅5 Q1 9⋅5
cumulative
frequencies
25 45⋅75 57
·
From the diagram
⇒ Q1 = 4⋅5 + 5 × = 7⋅7421875…= 7⋅74 to 3 S.F.
Q2 and Q3 can be found in a similar way.
Percentiles
Percentiles are calculated in exactly the same way as quartiles.
Example: For the 90th percentile, find and proceed as above.
Box Plots
In a group of people the youngest is 21 and the oldest is 52. The quartiles are 32 and 45,
and the median age is 41.
We can illustrate this information with a box plot as below – remember to include a scale.
lowest highest
value
Q1 Q2 Q3 value
age
20 30 40 50
Outliers
An outlier is an extreme value. You are not required to remember how to find an outlier – you
will always be given a rule.
Example: The ages of 11 children are given below.
age 3 6 12 12 13 14 14 15 17 21 26
Q1 = 12, Q2 = 14 and Q3 = 17.
Outliers are values outside the range Q1 – 1⋅5 × (Q3 – Q1) to Q3 + 1⋅5 × (Q3 – Q1).
Find any outliers, and draw a box plot.
Solution: Lower boundary for outliers is 12 – 1⋅5 × (17 – 12) = 4⋅5
Upper boundary for outliers is 17 + 1⋅5 × (17 – 12) = 24⋅5
⇒ 3 and 26 are the only outliers.
To draw a box plot, put crosses at 3 and 26, and draw the lines to 6 (the lowest value
which is not an outlier), and to 21 (the highest value which is not an outlier).
lowest highest
value Q1 Q2 Q3 value
not outlier not outlier
× ×
age
5 10 15 20 25
Note that there are other ways of drawing box plots with outliers, but this is the safest and will
never be wrong – so why not use it.
Skewness
A distribution which is symmetrical is not skewed
Positive skew
If a symmetrical box plot is stretched in the direction of the positive x-axis, then the resulting
distribution has positive skew.
Q1 Q2 Q3
PULL
Q1 Q2 Q3
For positive skew the diagram shows that Q3 – Q2 > Q2 – Q1
The same ideas apply for a continuous distribution, and a little bit of thought should show that
for positive skew mean > median > mode.
PULL
mode mean
median
Negative skew
If a symmetrical box plot is stretched in the direction of the negative x-axis, then the resulting
distribution has negative skew.
Q1 Q2 Q3
PULL
Q1 Q2 Q3
For negative skew the diagram shows that Q3 – Q2 < Q2 – Q1
The same ideas apply for a continuous distribution, and a little bit of thought should show that
for negative skew mean < median < mode.
PULL
mean mode
median
5 Measures of spread
Range & interquartile range
Range
The range is found by subtracting the smallest value from the largest value.
Interquartile range
The interquartile range is found by subtracting the lower quartile from the upper quartile,
so I.Q.R. = Q3 – Q1.
Variance and standard deviation
Variance is the square of the standard deviation.
1
, or
1
.
When finding the variance, it is nearly always easier to use the second formula.
Variance and standard deviation measure the spread of the distribution.
Proof of the alternative formula for variance
1 1
2
1 1 1
2
1 2
since = ∑ and N = ∑
1 1
2
Rough checks, m ± s, m ± 2s
When calculating a standard deviation, you should check that there is approximately
65 - 70% of the population within 1 s.d. of the mean and
approximately 95% within 2 s.d. of the mean.
These approximations are best for a fairly symmetrical distribution.
Coding and variance
Using the coding ti = we see that
xi = ati + k ⇒ =a +k
1 1
Notice that subtracting k has no effect, since this is equivalent to translating the graph, and
therefore does not change the spread, and if all the x-values are divided by a, then we need to
multiply st by a to find sx.
Example: Find the mean and standard deviation for the following distribution.
Here the x-values are nasty, but if we change them to form ti = then the
arithmetic in the last two columns becomes much easier.
x ti = f ti fi ti 2 fi
200 –2 12 –24 48
205 –1 23 –23 23
210 0 42 0 0
215 1 30 30 30
220 2 10 20 40
117 3 141
the mean of t is = ∑ = =
and the variance of t is ∑ = = 1⋅204470743
√1 · 204470743 1 · 097483824 = 1.10 to 3 S.F.
To find , using = , ⇒ = 5 + 210 = 215.5 to 1 D.P.
To find the standard deviation of x
sx = 5st = 5 × 1⋅0974838… = 5⋅49 to 3 S.F.
We would need to multiply the variance by 52 = 25
⇒ sx2 = 25st2 = 25 × 1⋅204470743 = 30⋅1 to 3 S.F.