STA 111 - Topic One - Lecture 2
STA 111 - Topic One - Lecture 2
In Lecture One, Introduction and Nature of Statistical Data was explained and the
following sections notes can be accessed on the week one forum on the ODEL platform.
1.1 Objectives
1.2 Introduction
1.3 Data
This is a continuation of Topic One lecture notes.
1.4.1 Objectives
By the end of the lecture the learner should be able to:
i) Summarize a set of data using a table or frequency distribution table.
ii) Display data graphically using bar graphs, histogram, frequency polygon,
frequency and Ogive curve and interpret the graphs.
1.4.2 Introduction
When data is collected (raw data) it’s usually not organized. After the data have been
collected, the next step is to present them in some suitable form. The need for proper
presentation arises because of the fact that statistical data in raw form almost defy
comprehension. Frequently the first stage in presenting data is to produce a table.
When the data consists of a few figures, it can be easily presented and understood. But
when the number of figures is very large, a proper classification is essential for analysis
and deriving valid differences.
Page 1 of 20
Some of the types of graphs that are used to summarize and organize data are the dot
plot, the bar graph, the histogram, the stem-and-leaf plot, the frequency curve, the
frequency polygon, the pie chart, the box plot and the cumulative frequency (ogive)
curve. In this course, we will briefly look at, histogram, line graphs, and bar graphs, as
well as frequency polygons, and the cumulative frequency curve.
Example 1.1
The following set of data consists of exam scores for 25 students
3,3,6,4,5,4,10,5,29,3,5,6,10,31,4,10,3,29,5,31,29,11,31,6,10
Construct an ungrouped frequency distribution table to represent this data set.
Solution
Steps of Construction of Ungrouped frequency distribution table:
1. Identify the smallest and the largest value in the data set and arrange all the values
in the data set in ascending (or descending) order.
2. Tally the number of times each value is appearing in the data.
3. Count the number of tallies of each quantity and record them as the frequency for
the value.
• The smallest value is 3 and the largest is 31
• Arranging the values in ascending order we obtain:
3,3,3,4,4,4,5,5,5,5,6,6,6,10,10,10,10,11,20,29,29,31,31,31,31
• Next step is to construct a frequency distribution table as follows:
( N/B – if the tally mark is 5 we use //// and not ///// )
Page 2 of 20
Scores (x) Tallies Frequency (f)
3 /// 3
4 /// 3
5 //// 4
6 /// 3
10 //// 4
11 / 1
29 /// 3
31 //// 4
total 25
Page 3 of 20
b) “Inclusive” method: Under the “Inclusive” method of classification, the upper
limit of one class is included in that class itself.
3. The number of classes denoted by 𝑘 falls between 5 and 15. (However, there is no
rigidity about it. The classes can be more than 15 depending upon the total number
of observations in the data and the details required). Further, the precise number of
classes to be used for a given variable may depend upon personal judgment and
other considerations such as the details required, the ease of calculation of further
statistical work, etc.
4. The classes should be mutually exclusive.
5. The starting point, i.e., the lower limit of the first class, should either be zero or 5 or
multiples of 5. For example, if the lowest value of the data is 63 and we have taken a
class-interval of 10, then the first class should be 60 – 70, instead of 63 – 73.
6. To ensure continuity and to get correct class-interval we should adopt “exclusive”
method of classification. However, where “inclusive” method has been adopted it is
necessary to make an adjustment to determine the correct class-interval and to have
continuity. See steps in the Construction of a Grouped Frequency Distribution
below. The adjustment consists of finding the difference between the lower limit of
the second class and the upper limit of the first class, dividing the difference by two,
subtracting the value so obtained from all lower limits and adding the value to all
upper limits. This can be expressed in the formula as follows;
Lower Limit of the 2nd class Upper Limit of the 1st class
𝑪𝒐𝒓𝒓𝒆𝒄𝒕𝒊𝒐𝒏 𝒇𝒂𝒄𝒕𝒐𝒓 =
2
7. Whenever possible all classes should be of the same size.
Page 4 of 20
Estimate the class interval (𝑖) (sometimes denoted by 𝑐) as
𝑅
or 𝑖 = 𝑅𝑜𝑢𝑛𝑑 𝑢𝑝 𝑡𝑜 𝑡𝑒 𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑢.
𝑘
Note: You must Round Up, not Round Off. For u = 1, Round Up (5.2) = 6 not 5 and for u
= 0.1 Round Up is exact (no remainder when divided by u) add one to the number of
classes. Or simply put round 𝑖 to the next highest whole number so that the classes
cover the whole data.
4. The starting value used in calculation of R above is picked as the lower class limit
(LCL) of the first class. Add the class interval 𝑖 to this LCL successfully to get the rest
of the lower class limits.
5. Find the Upper Class Limit (UCL) of the first class by subtracting 𝑢 from the LCL of
the second class. Then continue to add the class interval 𝑖 to this UCL to find the rest
of the upper limits.
6. If necessary, find the class boundaries (CB) for each class as follows.
Lower Class Boundary 𝐿𝐶𝐵 = 𝐿𝐶𝐿 − 0.5𝑢 (0.5𝑢 = the correction factor)
Upper Class Boundary 𝑈𝐶𝐵 = 𝑈𝐶𝐿 + 0.5𝑢
7. Tally the number of observations falling in each class and find the frequencies.
Note: A value x falls into a class LCL − UCL only if LCB ≤ x < UCB. That is x can be
equal to LCB but not UCB of that class.
8. Record the number of tallies in each category as the class frequencies.
9. Compute the cumulative frequencies to confirm that the last value of the column is
equal to the sum of the frequencies.
10. Compute the midpoints of each class using the class boundaries.
Example 1.2
The idea of grouped data can also be illustrated by considering the following raw
dataset:
Time taken (in seconds) by a group of students to answer a simple math question
20 25 24 33 13 16 21 17 11 34
26 8 19 31 11 14 15 21 18 17
The above data can be organized into a frequency distribution (or a grouped data) in
several ways. One method is to use intervals as a basis.
Page 5 of 20
The smallest value in the above data is 8 and the largest is 34. The interval from 8 to 34
is broken up into smaller subintervals (called class intervals). Suppose we want to have
number of classes as
𝑘 = 𝟏 + 𝟑. 𝟑𝟐𝟐𝐥𝐨 𝐠 𝟐𝟎 = 𝟓. 𝟑𝟐𝟐 which we round to 𝟔
Then the class width is obtained as:
34−8
𝑡𝑒 𝑐𝑙𝑎𝑠𝑠 𝑤𝑖𝑑𝑡 𝐶𝑊 = = 4.33 rounding to the next whole number 𝐶𝑊 𝑜𝑟 𝑖 = 5
6
Time taken (in Tallies Cumulative Class mid Time taken (in
seconds) Frequency seconds)
frequencies point
Class boundaries
5-9 / 1 1 7.5 4.5-9.5
10-14 //// 4 5 12.5 9.5-14.5
15-19 ///// 6 11 17.5 14.5-19.5
20-24 //// 4 15 22.5 19.5-24.4
25-29 // 2 17 27.5 24.5-29.5
30-34 /// 3 20 32.5 29.5-34.5
Note that to ensure continuity, the class limits are adjusted to obtain the true class limits
(class boundaries) as shown earlier in the principles of classification number (iv). This is
indicated in the last column.
Page 6 of 20
Example 1.3
Let the marks of 50 students of a class be:
46 58 54 52 55 59 52 62 65 67
64 63 77 78 92 6 7 12 18 16
3 23 25 25 27 81 88 24 29 22
34 33 30 37 36 42 48 28 22 28
17 13 70 37 32 36 41 40 43 44
We can arrange them as follows;
Marks Frequency Marks Frequency
0 – 10 3 50 – 60 6
10 – 20 5 60 – 70 5
20 – 30 10 70 – 80 3
30 – 40 8 80 – 90 2
40 – 50 7 90 – 100 1
Total 50
Data organized and summarized as in the above frequency distribution is called
grouped data.
Remark:
Consider the following;
Mass (K.g) Number of students
60-62 5
63-65 18
66-68 42
69-71 27
72-74 8
75- 0
66-68 referred to as class interval where 66 is the lower class limit while 68 is the upper
class limit. 75- Is the open class interval.
If the measurementare taken to the nearest Kg then for example 65.5-68.5 is the true
classlimits/ boundaries.
Mid-point between class limits is called class mid mark /midpoint. It’s used for all
mathematical analysis of frequency distribution.
Upper Limit of the class Lower Limit of the class
𝑴𝒊𝒅 − 𝒑𝒐𝒊𝒏𝒕 𝒐𝒇 𝒂 𝒄𝒍𝒂𝒔𝒔 =
2
Note: Relative Frequencies may also be calculated by dividing the number of cases in
each category by the total number of students (100) and multiplying by 100. For
42
example in the class 66-68, 𝑡𝑒 𝑟𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 = 100 × 100 = 42. Relative
frequencies are most useful where the class size is different.
Page 7 of 20
The list below shows One-way Commuting Distances (in Km) for 60 workers in Nairobi
city.
13 7 12 6 34 14 47 25 45 2
13 26 10 8 1 14 41 10 3 21
8 13 28 24 16 19 4 7 36 37
20 15 16 15 17 31 17 3 11 46
24 8 40 17 18 12 27 16 4 14
23 9 29 12 2 6 12 18 9 16
A histogram consists of a set of adjoining rectangles such that their bases are on x-axis
with centers at class marks and length equals class interval size. The horizontal axis is
labeled with what the data represents (for instance, distance from campus to your hostel). The
vertical axis is labeled either frequency or relative frequency (or percent frequency or
probability). The graph will have the same shape with either label. The histogram can give you
the shape of the data, the center, and the spread of the data.
The relative frequency is equal to the frequency for an observed value of the data divided by the
total number of data values in the sample or population. If:
f = frequency
n = total number of data values (or the sum of the individual frequencies), and
RF = relative frequency,
𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑓
Then 𝑅𝐹 = 𝑡𝑜𝑡𝑎𝑙 =𝑛
𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑖𝑒𝑠
The areas of the rectangles are proportional to the class frequencies. If class intervals
have equal sizes the histogram is obtained by plotting the frequencies against the true
class limits (class boundaries) such that the heights of rectangles are proportional to
class frequencies.
Page 8 of 20
But If class intervals are not equal, then plot the frequency density (or relative
frequencies) against the class boundaries as illustrated in Example 1.4 (ii).
𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 (𝑓)
𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑑𝑒𝑛𝑠𝑖𝑡𝑦 = 𝑐𝑙𝑎𝑠𝑠 𝑤𝑖𝑑𝑡 (𝑖)
To construct a histogram, first decide how many bars or intervals, also called classes, represent
the data. This usually equal to the number of intervals/ classes in the data set. But Choose a
starting point to be the lower class boundary of a class lower than the first interval in the data
set. For instance if the class intervals were: 10-15, 15-20,….. then the first interval will be 5-10
with a height/frequency zero.
Example 1.4
20
Frequency of students
15
10
Frequency
0
0-10 10 20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100
Marks
Page 9 of 20
ii) Construct a histogram to represent the following data set;
Frequency Histogram
8
7
6
Frequency density
5
4
3
2
1
0
14.5-19.5 29.5 34.5-39.5 49.5 54.5-59.5
class boundaries
Exercise: Suppose the classes were of equal widths, then construct a histogram (DIY)
Class limits 15-19 20-14 25-29 30-34 35-39 40-44
Page 10 of 20
Frequency 1 4 22 35 20 8
b) FREQUENCY POLYGONS
Frequency polygon is a graphical form of representation of data. It is used to depict the
shape of the data and to depict trends. It is usually drawn with the help of a histogram
but can be drawn without it as well. If a histogram is already drawn and the midpoint
of adjacent rectangles joined by straight lines we will obtain frequency polygons
Steps to Draw a Frequency Polygon
Mark the class intervals for each class on the horizontal axis. We will plot the
frequency on the vertical axis.
Calculate the classmark for each class interval. The formula for class mark is:
Mark all the class marks on the horizontal axis. It is also known as the mid-value
of every class.
Corresponding to each class mark, plot the frequency as given to you. The height
always depicts the frequency. Make sure that the frequency is plotted against the
class mark and not the upper or lower limit of any class.
Join all the plotted points using a line segment. The curve obtained will be
kinked.
This resulting curve is called the frequency polygon.
Page 11 of 20
Midpoint 0 5 15 25 35 45 55 65 75 85 95 105
Note:
It is customary to add the extensions PQ and RS to the next lower and next higher
midpoints which have corresponding class frequencies of zero.
ii) Plot the frequency polygon given the following data set
Frequency Polygon
40
35 35
30
Frequency
25
20 22 20
15
10
8
5 4
0 0 1 0
12 17 22 27 32 37 42 47
Class mid-points
Page 12 of 20
The height of the bar is proportional to the frequency of the variate but the thickness of
the bar is insignificant. A bar chart comprises a number of spaced rectangles and thus
do not suggest continuity and which generally have their major axes vertical. They can
be used to represent a large variety of statistical data. The bar chart is appropriate for
displaying discrete data with only a few categories.
Example 1.5
a) The following table gives the birth rate per thousand of different countries over a
certain period of time.
Country Kenya India china Uganda U.K. Sweden
Birth rate 30 33 40 29 20 15
Represent the above data by a suitable diagram.
Solution
The appropriate diagram for this data is a simple Bar diagram
Sweden
Birth rate per thousand of different 15
countries
50
40
30
Rate
Birth
20
10
0
Kenya India China U.K. Uganda Sweden
Country
Comparing the size of the bars, you can easily see that China has the highest birth rate
while Sweden has the lowest.
b) Consider data relating to the number of patients diagnosed with Bacterial
menegities in a hospital each year.
Year 2001 2002 2003 2004 2005
No.of patients 141 225 205 108 192
This data can be represented by the bar chart as shown below.
Page 13 of 20
The number of patients diagnosed with Bacterial menegities in a hospital during the
period 2001 – 2005.
300
Number of
patients
200
100
0
2001 2002 2003 2004 2005
Year
Notice that it is now easy to see that variations in the number of cases over this period
of time.
Multiple barcharts
Bar charts often prove most useful if we have two (or more) sets of comparable data,
and wish to compare and contrast them.
Example 1.6
Suppose that apart from the data relating to the number of patients diagnosed with
Bacterial menegities in a hospital each year, we also have the corresponding numbers
for Malaria cases.
Year 2001 2002 2003 2004 2005
Number of patients(Menegities 141 225 205 108 192
Number of patients(Malaria) 321 251 123 547 148
Page 14 of 20
This data can be represented in a component bar chart as shown in the figure below.
250
200
patients
150
Males Female
100
50
0
2001 2002 2003 2004 2005
year
Looking at this presentation, it is possible to discern two main features; firstly, we can
see how the menegities cases vary from year to year and secondly we can get a good
idea of the make up of this total in terms of proportions of patients who are male or
female.
Pie-Charts
A pie chart presents data in the form of a circle. The slices represent absolute or relative
proportions. A pie chart is formed by making of a portion of the pie corresponding to
each characteristic being displayed.
Example 1.8
A researcher studying the distribution of manufacturing costs in ABC Ltd found that
20% of the firm’s unit cost is due to labour, 40% raw materials, 25% maintenance costs
and 15% debt servicing. Present this information in a pie chart.
Fig 2: A pie chart representing the distribution of ABC Ltd per unit manufacturing cost
during the year.
Page 15 of 20
1.4.5 Graphical Presentation
a) Consider for example the sales data for some company over a period of six years as
shown in the table below;
600,000
500,000
400,000
300,000
200,000
100,000
0
2000 20001 2002 2003 2004
Page 16 of 20
“More Than” Ogive Curve
If we plot the “more than” cumulative frequencies against the corresponding lower
class boundaries and join the points by a smooth curve we get a “more than” ogive
curve.
Example 1.10
Plot the “More than” ogive curve of the marks of students given in example 2 above.
Solution
Marks Frequency More than cumulative frequency Lower class boundary
0 – 10 5 100 0
10 – 20 11 95 10
20 – 30 19 84 20
30 – 40 21 65 30
40 – 50 16 44 40
50 – 60 10 28 50
60 – 70 8 18 60
70 – 80 6 10 70
80 – 90 3 4 80
90 – 100 1 1 90
0 100
From the graph there are “y” students who scored more than “x” marks.
The value of x at the intersection of the two graphs is the median value.
This is a graph of v upper class boundaries and cumulative frequencies.
𝑐𝑓
Exercise 1.2
1. Consider the following data:
32, 46, 25, 57, 39, 45, 55, 42, 20, 36,
58, 12, 38, 34, 22, 40, 33, 64, 43, 46,
Page 17 of 20
31, 40, 52, 29, 14, 57, 66, 36, 32, 48,
46, 42, 47, 54, 65, 44, 35, 19, 54, 25,
23, 33, 38, 45, 32, 38, 41, 42, 58, 43.
Arrange the data in a frequency distribution with the first class interval 10 – 19
2. The highway patrol set up a radar checkpoint and recorded the speed in miles per
hour of a random sample of 50 cars that passed the checkpoint in one hour. The
speed of the cars was recorded as follows;
74 66 65 55 48 56 50 75 75 67
76 68 50 65 60 65 60 68 68 76
68 77 63 65 52 52 63 80 80 70
65 81 70 63 45 45 65 71 71 64
55 70 64 45 64 64 40 55 55 71
Make a frequency distribution table using 5 as the class width.
3. Given the data below:
3.0 3.4 4.1 4.1 4.3 2.7 3.5 3.7 3.4 3.4
3.8 4.2 3.1 3.9 3.1 4.1 2.8 3.7 4.4 3.5
3.5 3.4 3.7 3.7 2.8 4.3 3.8 3.4 4.1 3.0
4.4 4.1 4.1 3.6 3.4 2.7 3.6 3.0 3.4 4.3
3.8 3.2 4.2 3.9 4.2 3.4 2.9 4.4 3.5 3.9
Form a frequency distribution using the classes 2.7-2.9, 3.0-3.2, 3.3-3.5,……
4. Using Sturges’ rule, K = 1+ 3.322 log10 N,Where K = no. of class-intervals, N = total
no. of observations; classify, in equal intervals, the following hours worked by 20
workers in a factory for one month.
155, 120, 50, 110, 116, 95, 125, 42, 175, 130, 160, 90, 68, 71, 135, 147, 115, 108, 140, 98.
Find the percentage frequency in each class-interval.
5. Represent the following data by a histogram.
Marks Frequency Marks Frequency
0 – 10 5 50 – 60 10
10 – 20 11 60 – 70 8
20 – 30 19 70 – 80 6
30 – 40 21 80 – 90 3
40 – 50 16 90 – 100 1
Total 100
6. Using the data classified in questions 1, 2 and 3, draw:
a) A Histogram
b) A Frequency polygon
c) “less than” and “more than” Ogive curves.
7. A nutritionist is interested in knowing the percent of calories from fat
whichKenyans intake on a daily basis. To study this, the nutritionist randomly
selects 25 Kenyans and evaluates the percent of calories from fat consumed in a
typical day. The results of the study are as follows
Page 18 of 20
34% 18% 33% 25% 30%
42% 40% 33% 39% 40%
45% 35% 45% 25% 27%
23% 32% 33% 47% 23%
27% 32% 30% 28% 36%
Construct a frequency distribution and the corresponding histogram.
8. In Kenya, approximately 45% of the population has blood type O; 40% type A; 11%
type B; and 4% type AB. Illustrate this distribution of blood types with a pie chart.
9. In the academic years 1982 to 1985, the number of students in College ABC were as
follows;
Year Science Arts Law
1982-83 1000 1500 200
1983-84 1600 2000 350
1984-85 2100 4000 420
Represent the data by an appropriate diagram. (Component bar chart)
10. The table below gives data relating to the Kenyan exports and imports (in millions
of ofKsh) during the four years ending 1999-2004
Year Export Import
1999-2000 160000 200000
2000-2001 170000 300000
2001-2002 180000 350000
2002-2003 200000 300000
2003-2004 200000 380000
Source: KNBS
Represent this information using a suitable diagram. (multiple bar chart)
11. The following table shows the Kenyan population age structure as per the 2009
census
Age %of total population male female
0-14 40.02 9557274 9497870
15-24 19.15 4552448 4567894
25-54 33.91 8170264 7976751
55-64 3.92 856092 1009075
65 years and above 3 614751 813320
Source CIA World Factbook 2017
How best would you represent this data diagrammatically?
12. The following data represents the maximum temperatures in degrees centigrade
predicted for some 55 major cities on the 24th September 1993.
17 25 21 18 14 15 24 22 15 21 25
17 25 15 18 17 29 16 24 39 30 23
23 27 43 28 29 15 15 19 32 30 32
23 13 18 13 27 32 17 17 25 25 30
20 18 17 33 28 27 26 32 32 33 19
Page 19 of 20
a) Construct a frequency distribution table for theses temperatures starting with
the classes: 11-17, 18-24,…….
Solution
Temperature (oC) Frequency
11 – 17 15
18 - 24 15
25 - 31 16
32 - 38 7
39 - 45 2
b) Represent the data using a histogram, a frequency polygon and an ogive curve
c) Using the appropriate diagrammatic/graph representation of the data, estimate:
i) The modal temperature
ii) The median temperature
iii) The lower and upper class boundaries of the temperature range within which
the middle 50% of all cities lie.
iv) The minimum and maximum temperature of the middle 80% of the cities.
v) On this particular day, a researcher was collecting some data and required
data from cities whose temperatures were above 29.50 𝐶. How many of these
cities did he include in his study?
Page 20 of 20