Chapter 1 Notes
Chapter 1 Notes
Chapter 1
Summarizing and Describing Data
In order to visualize the distribution of a set of raw data, we ought to compile the data into a
more comprehensible form, making use of tables and graphs.
A. Frequency Tables
Given a set of raw data we usually arrange it into a frequency distribution where we collect
‘like’ quantities and display them by writing down how many of each type there are to form
a frequency table.
2-1
Example 1
In a multiple-choice test with 10 questions, the numbers of correct answers of 40 students are as follows.
10 4 9 6 7 4 8 7 7 5
5 8 7 9 10 6 5 9 7 6
4 7 5 6 7 9 5 8 8 4
8 7 7 5 5 4 8 6 6 6
Solution:
2-2
B. Bar Chart
Example 2
Using the frequency table constructed in Example 1, draw a bar chart for the distribution of the
number of correct answers of 40 students in the multiple-choice test.
Solution:
10
Frequency (Number of students)
0
4 5 6 7 8 9 10
Number of correct answers
2-3
C. Stem-and-leaf diagrams
The stem-and-leaf diagram involves a combination of a graphical technique and a sorting technique.
By sorting it means listing the data in rank order according to numerical value.
The data values themselves are use to do this sorting.
The “stem” is the leading digit(s) of the data, while the “leaf” is the trailing digit.
A stem-and-leaf diagram is a method of presenting a data set so that gaps or concentrations in the
data become visible.
2-4
Example 3
Suppose that a class of 40 students obtained the following results in a Mathematics test.
61 80 55 70 76 73 100 90 64 62
75 64 62 66 46 61 67 39 58 63
63 64 51 40 66 43 38 37 28 71
70 49 48 68 86 27 69 74 37 56
Solution:
Stem Leaf
(Tens) (Units)
2 78
3 7789
4 03689
5 1568
6 11223344466789
7 0013456
8 06
9 0
10 0
2-5
Advantages of a stem-and-leaf diagram
2. It is actually partly a table and partly a graph and so it immediately and directly gives a good
picture of the frequency distribution without having to prepare a frequency table first and then
construct charts afterwards.
3. Since the actual data are recorded in the diagram, it retains the information about the original data,
and the information may be recovered readily.
In a frequency table or histogram, data are represented by tallies or areas of rectangles in class intervals
and so some information about the original data is lost and cannot be recovered.
For example, the reading 64 is recorded in its entirety in a stem-and-leaf diagram, but is represented
only by a count of 1 in the class interval (e.g. 60 – 64) in a frequency table or histogram.
4. It can be regarded as the original set of data arranged in ascending order of magnitude.
Hence it can be readily used for finding quartiles.
1. For some type of data, the number of stems that can be chosen is either very small or very large,
thus making the diagram inconvenient to construct and unable to show the distribution effectively.
Actually, for a large set of data, the purpose of graphical representation is to give a good overall
picture of the distribution rather than to show the details of the data.
A bar chart or a histogram is more suitable in this case.
2-6
Example 4
A fishery expert found the following concentrations of mercury, in parts per million, in thirty fish caught
in a certain stream.
0.024 0.031 0.052 0.024 0.024 0.030 0.056 0.034 0.059 0.068
0.035 0.021 0.052 0.023 0.054 0.028 0.037 0.034 0.048 0.040
0.022 0.049 0.043 0.034 0.032 0.021 0.040 0.032 0.021 0.039
Solution:
Stem Leaf
(Unit = 0.01) (Unit = 0.001)
2 11123444
2 8
3 0122444
3 579
4 003
4 89
5 224
5 69
6
6 8
In the above diagram, the units of the stems and leaves have been chosen to make the recorded digits simple.
This is an important feature of a stem-and-leaf diagram.
2-7
1.2 Statistical Descriptions
In statistics, there are two useful types of measure which characterize any set of data or
frequency distribution.
The first type, a measure of ‘centralization’, attempts to locate a typical value about which the
distribution clusters. This type of measure is called an average or measure of central tendency
or measure of location.
The second type is a measure of how scattered or spread out a distribution is and is called
a measure of dispersion.
(a) (b)
2-8
I. Measures of Central Tendency
The most common measures of central tendency or average are the mean, the median and the mode.
Given the complete set of N data {x 1 , x 2 ,!, x N } in a population, the mean µ , is defined as
1 1 N
µ= (x1 + x 2 + ! + x N ) or µ= ∑ xi
N N i =1
If the set of n data {x p1 , x p 2 ,!, x p n } , where the p i ’s are a set of integers selected from 1 to N,
is a sample of size n drawn from a population, then the sample mean is defined similarly,
but is denoted by x (read as x bar). Thus
1 1 n
x = ( x p1 + x p 2 + ! + x p n ) or x = ∑ x pi
n n i =1
The notation x p i for the elements of the sample may be a bit difficult for beginners.
Hence, when no misunderstanding arises, we shall denote the sample of size n simply as
{x1 , x 2 ,!, x n }
Bearing in mind that the element x i in the sample is, in general, not the same element x i in
the population.
1 1 n
With this understanding, the sample mean is x = (x1 + x 2 + ! + x n ) or x = ∑ xi
n n i =1
2-9
Example 4
Suppose that a class of 40 students obtained the following results in a Mathematics test.
61 80 55 70 76 73 100 90 64 62
75 64 62 66 46 61 67 39 58 63
63 64 51 40 66 43 38 37 28 71
70 49 48 68 86 27 69 74 37 56
Solution:
= 60.425
Note that a population mean is a unique value, but the sample mean varies from sample to sample.
2-10
B. Median
The median is a measure of position. It is the middle value in an ordered sequence of data.
To find the median from a set of data collected in its raw form, we must first arrange the data
in rank order, from the smallest to the largest observation. Such an ordered sequence of data is
called an ordered array.
(i) if n is odd, x n +1 is the median, the median is the value of the datum that is in the middle.
2
1⎛ ⎞
(ii) if n is even, the median is ⎜ x n + x n ⎟ , the median is the mean of the two data that are
2 ⎜⎝ 2 +1 ⎟
2 ⎠
Example 5
(a) Find the median of the set of data {12, 8, 13, 16, 5}.
(b) Find the median of the set of data {25, 25, 37, 26, 25, 12, 75, 75}.
Solution:
(a) Arrange the set of five data in ascending order 5, 8, 12, 13, 16, the median is x 5+1 = x 3 = 12
2
(b) Arrange the set of eight data in ascending order 12, 25, 25, 25, 26, 37, 75, 75,
1⎛ ⎞ 1 1
the median is ⎜ x 8 + x 8 ⎟ = ( x 4 + x 5 ) = (25 + 26) = 25.5
⎜
2⎝ 2 +1 ⎟ 2 2
2 ⎠
2-11
C. Mode
The mode of a set of data is the value that occurs with the highest frequency.
In this sense it is “most typical” of a set of data
A distribution with one mode is called a unimodal distribution, while those with two modes are
bimodal, and with three or more are multimodal.
The two main advantages of mode are that it requires no calculations, only counting, and that
it can be determined for qualitative as well as quantitative data.
However, if all values are different in the set of data, certainly, the mode is useless in such a situation.
Example 6
Suppose that 50 children are asked which of the six brands of soft drink they prefer most
and the following results are obtained.
Brand A B C D E F
Number of children 4 15 5 8 3 15
Solution:
2-12
II. Measures of Dispersion
The measures of central tendency can provide only brief information on a set of data.
Obviously, for a set of data, the averages alone cannot tell us how spread out or dispersed the data are.
We need some measures of dispersion, a numerical value indicating the amount of scatter about
a central point.
Widely dispersed data are also highly variable data. Hence measures of dispersion are also called
measures of variability.
The most common measures of dispersion in statistics are the range, the inter-quartile range,
the variance and the standard deviation.
2-13
A. Range
The range of a set of data is the difference between the largest value and the smallest value of the set.
In general, the greater the range, the greater the dispersion of the set of data.
Example 7
Solution:
Since the range of score of athlete A is greater than that of athlete B, we say that the scores of
athlete A is more dispersed than those of athlete B.
2-14
B. Inter-quartile range
With the set of data arranged in ascending order, the median is the value which divides the set of
data into two equal parts.
Similarly, if we divide the set of data into four equal parts, the corresponding values, denoted by
Q1 , Q 2 , Q 3 are called the first, second and third quartiles respectively.
And Q 2 is just the median of the distribution.
In dividing the set of data into 100 equal parts, the values are called percentiles and
are denoted by P1 , P2 , …, P99 .
The 50 th percentile, P50 , corresponds to the median,
whereas P25 and P75 corresponds to Q1 and Q 3 respectively.
The p th percentile of a data set is a value such that at least p percent of the items take on this value or less
and at least (100 – p) percent of the items take on this value or more.
Q1 is the first quartile (or lower quartile) where 25% of the data lie below it;
Q 2 is the second quartile (or middle quartile or median) where 50% of the data lie below it; and
Q 3 is the third quartile (or upper quartile) where 75% of the data lie below it.
To find the p th percentile, first arrange the set of discrete data x 1 , x 2 , …, x n in ascending order,
then compute index i, where
p
i= ×n
100
to find the position of the p th percentile.
If i is not an integer, round up to the nearest integer. The p th percentile is the value in the i th position.
If i is an integer, the p th percentile is the average of the values in positions i and i + 1.
2-15
Example 8
(a) Find the inter-quartile range of the data set A {14, 23, 16, 18, 15, 44, 19}.
(b) Find the inter-quartile range of the data set B {10, 15, 40, 28, 34, 18, 24, 30}.
(c) By comparing the inter-quartile range of the data sets A and B, which set has a greater dispersion?
Solution:
(a) Arrange the seven data of the data set A in ascending order 14, 15, 16, 18, 19, 23, 44.
25
For the 25 th percentile, the index i = × 7 = 1.75 = 2 (round up to the nearest integer),
100
hence Q1 = x 2 = 15
75
For the 75 th percentile, the index i = × 7 = 5.25 = 6 (round up to the nearest integer),
100
hence Q 3 = x 6 = 23
The inter-quartile range = Q 3 − Q1 = 23 – 15 = 8
(b) Arrange the eight data of the data set B in ascending order 10, 15, 18, 24, 28, 30, 34, 40.
25
For the 25 th percentile, the index i = × 8 = 2,
100
1 1
hence Q1 = ( x 2 + x 3 ) = (15 + 18) = 16.5
2 2
75
For the 75 th percentile, the index i = × 8 = 6,
100
1 1
hence Q 3 = ( x 6 + x 7 ) = (30 + 34) = 32
2 2
The inter-quartile range = Q 3 − Q1 = 32 – 16.5 = 15.5
The range considers the difference between the maximum and minimum values of a set of data.
The inter-quartile range considers the range of 50% of the data in the middle and thus avoids the
impact of extreme values.
Therefore if there are extreme values in a set of data, the inter-quartile range is a better measure of
dispersion than the range.
Moreover, the inter-quartile range exists even if the set of data has open ends.
2-16
Box-and-Whisker Diagram
The median, the lower quartile and the upper quartile together with the maximum and the minimum
values provide a good description of a set of data as they indicate some of the most important
characteristics of the set. These five key descriptive statistical measures are often called the
five-number summary of the set of data. A graphical display of these measures, called a
box-and-whisker diagram or a box plot, gives an even better visual impression of the set.
middle
$!!! 50
#%!of!data
!"
lower upper
25% of data 25% of data
$!!#!!" $!!#!!"
_____________ _____________
Minimum Q1 Q2 Q3 Maximum
(median)
IQR
Range
A box-and-whisker diagram consists of a rectangular box drawn with its length parallel to the x-axis
and with its ends marking the position of the lower and the upper quartiles. An orange bar is then
inserted in the box to mark the median. The two extreme values, the minimum and the maximum
values of the data, are linked to the box by lines, called whiskers, parallel to the x-axis.
A glance at the diagram then gives us good information about the central tendency, dispersion and
extreme values of the set.
(1) The bar at the median shows the location of the centre of the data.
(2) The length of the box is equal to the inter-quartile range shows the dispersion of 50% of the data
in the middle, a measure of dispersion.
(3) The lengths of the whiskers show the dispersion of the data below the lower quartile and
above the upper quartile, describe the behavior at the ends or tails of the distribution.
(4) The shape of the diagram gives us a quick impression on the degree of symmetry of the data
distribution about the median.
It is easy to use box-and-whisker diagrams to compare the features, such as location of centre,
dispersion and symmetry of different sets of data. However, a box-and-whisker diagram does not
reveal the total frequency of each set of data, nor the frequency of the data for any specific range.
If such information is required, a stem-and-leaf diagram, bar chart or histogram can be used.
2-17
Box-and-whisker diagrams are particularly useful for comparing the central tendency and
the dispersion of two or more sets of data.
Example 9
The following box-and-whisker diagrams show the distributions of marks of Chinese, English and
Mathematics test.
(a) Which test has the marks with the largest inter-quartile range?
(b) Which test has the marks with the smallest range?
(c) Which test has the highest median mark?
(d) If Mary gets 70 marks in all three tests, in which test does she perform the best?
Briefly explain your answer.
Solution:
(a) Since the length of the box of Mathematics test is the largest, Mathematics test has the marks with
the largest inter-quartile range.
(b) Since the distance between two ends of the whiskers of Chinese test is the shortest.
Chinese test has the marks with the smallest range.
(c) Since the orange bar in the box of Mathematics test is at the rightmost position, the median mark
of Mathematics test is the highest.
(d) Since from the box-and-whisker diagram above, the mark of Mary’s English test is in the top
25% of the class while her marks in Mathematics and Chinese tests are not.
Mary performs the best in English test.
2-18
Skewness of Distributions
A distribution is symmetric if the parts above and below its center are mirror images.
If Q 2 − Q1 = Q 3 − Q 2 , the distribution is symmetric.
Min Q1 Q2 Q3 Max
A distribution is skewed to the right if the right side is longer, while it is skewed to the left if the left
side is longer.
For a positively skewed or right-skewed distribution, an asymmetric distribution with a “tail” on the right
indicates the presence of extreme values at the positive end of the distribution.
A distribution is positively skewed if Q 2 − Q1 < Q 3 − Q 2
Min Q1 Q 2 Q3 Max
For a negatively skewed or left-skewed distribution, an asymmetric distribution with a “tail” on the left.
A distribution is negatively skewed if Q 2 − Q1 > Q 3 − Q 2
Min Q1 Q 2 Q 3 Max
2-19
Example 10
Using the stem-and-leaf diagram constructed in Example 5 for the distribution of results of the class
of 40 students in the Mathematics test.
Stem Leaf
(Tens) (Units)
2 78
3 7789
4 03689
5 1568
6 11223344466789
7 0013456
8 06
9 0
10 0
(a) Find the median, the first and the third quartiles.
(b) Construct the box-and-whisker diagram.
(c) Use the quartiles to comment on the skewness of the distribution.
Solution:
1⎛ ⎞ 1 1
(a) The median is ⎜ x 40 + x 40 ⎟ = ( x 20 + x 21 ) = (63 + 63) = 63
2 ⎜⎝ 2 2
+1 ⎟
⎠ 2 2
25 1 1
For the 25 th percentile, the index i = × 40 = 10 , hence Q1 = ( x 10 + x 11 ) = (48 + 49) = 48.5
100 2 2
75 1 1
For the 75 th percentile, the index i = × 40 = 30 , hence Q 3 = ( x 30 + x 31 ) = (70 + 70) = 70
100 2 2
(b)
63
27 48.5 70 100
25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
2-20
Example 11
The table below gives the monthly salaries in dollars of 25 employees of a certain department.
Solution:
(a)
Stem Leaf
(Unit = $1000) (Unit = $100)
6 126
7 348
8 279
9 25679
10 245
11 9
12 7
13 3
14 2
15 5
16 6
17 9
18
19
20 2
2-21
1
(b) The mean = (7800 + 11900 + 12700 + 10400 + 20200 + 6200 + 7300 + 9200 + 15500 + 17900
25
+ 9700 + 9500 + 10500 + 13300 + 10200 + 9900 + 14200 + 8900 + 8700 + 16600
+ 7400 + 6600 + 9600 + 6100 + 8200)
= 10740
(c) Making use of the stem-and-leaf diagram for the distribution of the salaries (with a column of
cumulative frequencies added to help locating the quartiles),
25
For the 25 th percentile, the index i = × 25 = 6.25 = 7 (round up to the nearest integer),
100
hence Q1 = x 7 = 8200 .
75
For the 75 th percentile, the index i = × 25 = 18.75 = 19 (round up to the nearest integer),
100
hence Q 3 = x 19 = 12700 .
(d)
9700
6000 7000 8000 9000 10000 11000 12000 13000 14000 15000 16000 17000 18000 19000 20000 21000
2-22
C. Variance and Standard Deviation
Although the inter-quartile range is an improved measure of dispersion compared with the range,
still it does not make use of the actual values of all the data in the set, therefore, cannot completely
reflect the dispersion of the data. A measure of dispersion which does take into account the
dispersion of all the values is the variance and standard deviation.
To overcome the limitations of range and inter-quartile range mentioned above, we can find the
distance of each datum from the centre of a group of data. The greater the average distance of all
data from the centre, the wider the dispersion of a set of data is.
If the set of N data {x 1 , x 2 ,!, x N } represents a population with mean µ , then the variance of the
set of data is defined as the mean of the squares of the deviations of individual values from the
population mean, and is commonly denoted by σ 2 . Thus, population variance
1 N 1
σ2 = ∑ ( x i − µ) 2 = [(x 1 − µ) 2 + ( x 2 − µ) 2 + ! + ( x N − µ) 2 ]
N i =1 N
Large variances indicate large dispersion and small variance indicate small dispersion.
However, the variance defined above does not have the same unit as the original values of x.
To have a measure of dispersion with the same unit as the original data, we take the positive square
root of the variance. The resulting measure is called the standard deviation of the set of data. Thus,
1 N 1
Population standard deviation σ = ∑ ( x i − µ) 2 = [(x 1 − µ) 2 + ( x 2 − µ) 2 + ! + ( x N − µ) 2 ]
N i =1 N
If the set of n data {x1 , x 2 ,!, x n } is a sample of size n drawn from a population and with mean x ,
1 n 1
s2 = ∑ (x i − x) 2 = [(x 1 − x ) 2 + ( x 2 − x ) 2 + ! + ( x n − x ) 2 ]
n − 1 i =1 n −1
The sample standard deviation, s, is the positive square root of the sample variance.
1 n 1
s= ∑ (x i − x) 2 = [(x 1 − x ) 2 + ( x 2 − x ) 2 + ! + ( x n − x ) 2 ]
n − 1 i =1 n −1
2-23
Note that the differences between sample variance s 2 and population variance σ 2 are
the sample mean x is used instead of the population mean µ , and the divisor is n – 1 instead of N.
Standard deviation can give us an idea about how close all the data are from their mean, and thus
we can learn about the consistency of the set of data.
The smaller the standard deviation, the less dispersed the set of data is.
In other words, the distribution of data in the set is more consistent.
2-24
Example 12
The temperatures (in o C ) of water in seven beakers are: 30, 32, 33, 28, 31, 29, 34.
(a) Find the mean of the temperatures of the water.
(b) Find the population standard deviation of the temperatures of the water.
Solution:
2-25
Example 13
(a) Find the variance and standard deviation of the population of Mathematics test marks in Example 7
with the population mean 60.425.
(b) If the passing mark is one population standard deviation less than the mean, find the number of
students failed in the Mathematics test.
(c) The sample S 2 = {68, 62, 48, 39, 38, 55, 66, 71, 37, 76} has been drawn from the population of
Mathematics test marks in Example 7. The sample mean was found to be 56.
Find the sample variance and sample standard deviation.
Solution:
2-26
Use Scientific Calculator to find mean and standard deviation
Use the calculator to find the mean and standard deviation of the data set
{1, 2, 5, 6, 8, 9, 10, 12, 14, 18}
2-27
Use Scientific Calculator to find mean and standard deviation
Use the calculator to find the mean and standard deviation of the data set
{1, 2, 5, 6, 8, 9, 10, 12, 14, 18}
2-28