FAKULTI SAINS & TEKNOLOGI
SMS1012 - DATA ANALYTICS
CHAPTER 2
1 PRACTICAL
1.1 Example: Out-of-pocket prescription medicine expenses
The data set lists the out-of-pocket prescription medicine expenses (in dollars) for 30 U.S. adults in a recent
year.
(Adapted from: Health, United States, 2015)
200 239 155 252 384 165 296 405 303 400
307 241 256 315 330 317 352 266 276 345
238 306 290 271 345 312 293 195 168 342
(a) Construct a frequency distribution that has seven classes.
For this example, we have calculated the frequency distribution from the previous lecture. The fre-
quency distribution is given as follows:
Class Frequency, f Midpoint Relative frequency Cumulative frequency
155-190 3 172.5 0.1 3
191-226 2 208.5 0.07 5
227-262 5 244.5 0.17 10
263-298 6 280.5 0.2 16
299-334 7 316.5 0.23 23
335-370 4 352.5 0.13 27
371-406 3 388.5 0.1 30
(b) Display the data using frequency histogram and a frequency polygon using R.
We can plot frequency histogram as follows:
#Create vector for data
med_exp=c(200,239,155,252,384,165,296,405,303,400,
307,241,256,315,330,317,352,266,276,345,
238,306,290,271,345,312,293,195,168,342)
#create class boundaries
c_bound=seq(154.5,406.5,by=36)
#Create vector for frequency
freq=c(3,2,5,6,7,4,3)
#Create frequency histogram (labeled with class boundaries)
hist(med_exp,breaks=c_bound,xaxp=c(154.5,406.5,7),ylim=c(0,8),
main="Out-of-Pocket Prescription Medicine Expenses",xlab="Expense (in dollars)",
ylab="Frequency (number of adults)")
To add frequency number in the graph, we can use
mid_point=c(172.5,208.5,244.5,280.5,316.5,352.5,388.5)
text(mid_point,freq,labels=freq,adj=c(0.2, -1.0))
Out−of−Pocket Prescription Medicine Expenses
8
Frequency (number of adults)
7
6
6
5
4
4
3 3
2
2
0
154.5 190.5 226.5 262.5 298.5 334.5 370.5 406.5
Expense (in dollars)
We also can plot frequency histogram using class midpoints.
#Create frequency histogram (labeled with class midpoints)
hist(med_exp,breaks=c_bound,xaxp=c(172.5,388.5,6),ylim=c(0,8),
main="Out-of-Pocket Prescription Medicine Expenses",xlab="Expenses (in dollars)",
ylab="Frequency (number of adults)")
text(mid_point,freq,labels=freq,adj=c(0.5, -0.5))
Out−of−Pocket Prescription Medicine Expenses
8
Frequency (number of adults)
7
6
6
5
4
4
3 3
2
2
0
172.5 208.5 244.5 280.5 316.5 352.5 388.5
Expenses (in dollars)
Page 2
To construct the frequency polygon, we can use the same horizontal and vertical scales as the frequency
histogram with class midpoint labels. However, the beginning and end of the graph should be on the
horizontal axis. Therefore, we should extend the left side by one class width before the first class
midpoint and the right side by one class width after the final class midpoint.
#create midpoints vector
mid_point2=c(136.5,172.5,208.5,244.5,280.5,316.5,352.5,388.5,424.5)
After that, we must expand the number of classes to 9, with zero frequency at the beginning and end
of each class.
#Create frequency vector for frequency polygon
freq2=c(0,freq,0)
Then, we can plot frequency polygon as follows:
#Plot frequency polygon
plot(mid_point2, freq2, type = "b", pch = 20, col = "blue", lwd = 1,xaxp=c(136.5,424.5,6),
ylim = c(0,8),main="Out-of-Pocket Prescription Medicine Expenses",xlab="Expenses (in dollars)",
ylab="Frequency (number of adults)")
text(mid_point,freq,labels=freq, adj=c(0.5, -0.5))
Out−of−Pocket Prescription Medicine Expenses
8
Frequency (number of adults)
7
6
6
5
4
4
3 3
2
2
0
136.5 184.5 232.5 280.5 328.5 376.5 424.5
Expenses (in dollars)
(c) Display the data using a relative frequency histogram using R. First, we need to calculate relative
frequency and change the current histogram frequency to relative frequency.
h=hist(med_exp,breaks=c_bound)
#Calculate relative frequency
h$counts=h$counts/sum(h$counts)
Then, we can plot relative frequency histogram as follows:
#Plot relative frequency histogram
plot(h,xaxp=c(154.5,406.5,7),ylim=c(0,0.25),col="grey",main="Out-of-Pocket Prescription
Medicine Expenses",xlab="Expenses (in dollars)",ylab="Relative frequency (portion of adults)")
text(mid_point,h$counts,labels=round(h$counts,2), adj=c(0.5, -0.5))
Page 3
Out−of−Pocket Prescription Medicine Expenses
Relative frequency (portion of adults)
0.23
0.2
0.20
0.17
0.13
0.1 0.1
0.10
0.00 0.07
154.5 190.5 226.5 262.5 298.5 334.5 370.5 406.5
Expenses (in dollars)
(d) Display the data using an ogive using R.
First, we need to calculate the cumulative frequency
#Create cumulative frequency vector
cum_freq = c(0, cumsum(freq))
#Create upper class limit vector for cumulative frequency
upper_class=c(154.5,190.5,226.5,262.5,298.5,334.5,370.5,406.5)
Then, we can plot ogive as follows:
#create cumulative frequency graph or Ogive
plot(upper_class,cum_freq, type = "b", pch = 20, col = "blue", lwd = 1,xaxp=c(136.5,
424.5,6),xaxt = ’n’,,main="Out-of-Pocket Prescription Medicine Expenses",xlab="Expenses
(in dollars)",ylab="Relative frequency (portion of adults)")
#to create horizontal axis
axis(1,at=upper_class)
Out−of−Pocket Prescription Medicine Expenses
Relative frequency (portion of adults)
30
25
20
15
10
5
0
154.5 190.5 226.5 262.5 298.5 334.5 370.5 406.5
Expenses (in dollars)
Page 4
1.2 Example: Text messages sent
The data set lists the numbers of text messages sent in one day by 50 cell phone users.
(Adapted from Pew Research)
(a) Display the data in a stem-and-leaf plot using R.
We can create a stem-and-leaf plot as follows:
text_msg=c(76,122,66,76,41,26,33,29,23,38,
49,76,80,115,86,29,24,32,33,34,
102,89,78,99,48,33,43,29,30,53,
58,67,69,72,52,26,16,29,41,30,
88,80,56,19,28,20,39,40,33,149)
#create stem and leaf plot
stem(text_msg,scale=2)
Here is the stem-and-leaf plot:
The decimal point is 1 digit(s) to the right of the |
1 | 69
2 | 0346689999
3 | 0023333489
4 | 011389
5 | 2368
6 | 679
7 | 26668
8 | 00689
9 | 9
10 | 2
11 | 5
12 | 2
13 |
14 | 9
(b) Use a dot plot to organize the data set using R.
We can create a dot plot as follows:
Page 5
#Create dot plot
stripchart(text_msg, method = "stack", offset = 1, at = .15, pch = 19,xaxp=c(0,150,30),
main = "Number of Text Message Sent", xlab = "Frequency")
Number of Text Message Sent
15 25 35 45 55 65 75 85 95 105 120 135 150
Frequency
1.3 Example: Earned degree conferred
The numbers of earned degrees conferred (in thousands) in 2014 are shown in the table. Use a pie chart to
organize the data.
(Source: U.S. National Center for Educational Statistics)
To create pie chart, we need to the relative frequency (percent) of each category.
#Create vector for data
freq_degree <- c(1003, 1870, 754, 178)
degree <- c("Associate’s", "Bachelor’s", "Master’s", "Doctoral")
#Calculate percentage
pct <-round(100 * freq_degree / sum(freq_degree), 1)
Then, we can create pie chart as follows:
#Create pie chart
pie(freq_degree,labels= paste(degree, sep = " ", pct, "%"), col = c("#FF8805","blue","green",
"yellow"),main = "Earned Degrees Conferred in 2014")
Page 6
Earned Degrees Conferred in 2014
Associate's 26.4 %
Bachelor's 49.1 %
Doctoral 4.7 %
Master's 19.8 %
1.4 Example: Death causes
In 2014, these were the leading causes of death in the United States.
• Accidents: 136,053
• Cancer: 591,699
• Chronic lower respiratory disease: 147,101
• Heart disease: 614,348
• Stroke (cerebrovascular diseases): 133,103
Construct a Pareto chart to organize the data using R.
(Source: Health, United States, 2015, Table 19)
To create a Pareto chart, we need to create data frame.
#Create vector for data
causes=c("Accidents", "Cancer", "CLRD", "Heart disease", "Stroke")
count=c(136053,591699,147101,614348,133103)
newcount=round(count/1000,2)
#create data frame
df <- data.frame(causes,newcount)
Next, we must sort the data frame according to whether it is increasing or decreasing.
#sorting the data
new=df[order(-newcount),]
Then, we can create a Pareto chart as follows:
#Create a simple Pareto chart
h=barplot(new$newcount,names.arg=new$causes,las=1,ylim =c(0,700),xlab = "Cause",
ylab="Deaths (in thousands)",main="Top Five Causes of Death in the United States")
#Put value in chart
text(h,4,count[order(-count)], cex=1,pos=3)
Page 7
Top Five Causes of Death in the United States
700
600
Deaths (in thousands)
500
400
300
200
100
614348 591699 147101 136053 133103
0
Heart disease Cancer CLRD Accidents Stroke
Cause
We may also generate a Pareto chart using the qcc package as follows:
#Create Pareto chart using package
library(qcc) #If error, then you need go to TOOLS->INSTALL PACKAGES->type qcc and install
names(newcount)=c("Accidents", "Cancer", "CLRD", "Heart disease", "Stroke")
pareto.chart(newcount,las=1,xlab = "Cause",ylab="Deaths
(in thousands)",main="Top Five Causes
of Death in the United States")
Top Five Causes of Death in the United States
75% 100%
1500
Deaths (in thousands)
Cumulative Percentage
1000
50%
500
25%
0%
0
Heart disease Cancer CLRD Accidents Stroke
Cause
1.5 Example:Fisher’s Iris data set
The British statistician Ronald Fisher introduced a famous data set called Fisher’s Iris data set. This data
set describes various physical characteristics, such as petal length and petal width (in millimeters), for three
species of iris. The petal lengths form the first data set and the petal widths form the second data set.
Display the data in a scatter plot using R.
(Source: Fisher, R. A., 1936)
Page 8
Since the Fisher’s Iris data set is already available in the R library, the scatter plot can be generated as
follows:
#plot scatterplot using Fisher’s dataset
plot(iris$Petal.Length,iris$Petal.Width,main="Fisher’s Iris Data Set",xlab="Petal length
(in centimeters)", ylab="Petal width (in centimeters)", pch=19)
Fisher's Iris Data Set
2.5
Petal width (in centimeters)
2.0
1.5
1.0
0.5
1 2 3 4 5 6 7
Petal length (in centimeters)
1.6 Example: Motor vehicle thefts and burglaries
The table lists the number of motor vehicle thefts (in millions) and burglaries (in millions) in the United
States for the years 2005 through 2015. Construct a time series chart for the number of motor vehicle thefts
using R.
(Source: Federal Bureau of Investigation, Crime in the United States)
Let the horizontal axis represent the years and let the vertical axis represent the number of motor vehicle
thefts (in millions). Then, we can construct a time series chart as follows:
year=c(2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015)
thefts=c(1.24,1.2,1.1,0.96,0.8,0.74,0.72,0.72,0.7,0.69,0.71)
#plot time series plot for Theft
Page 9
plot(year,thefts,type="b",pch=20,ylim=c(0,1.4),main="Motor Vehicle Thefts",xlab="Year",
ylab="Thefts (in millions)")
axis(1,at=year)
Motor Vehicle Thefts
1.2
Thefts (in millions)
0.8
0.4
0.0
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Year
Page 10
2 EXERCISES
1. The data set represents the numbers of minutes a sample of 27 people exercise each week.
108 139 120 123 120 132 123 131 131
157 150 124 111 101 135 119 116 117
127 128 139 119 118 114 127 142 130
(a) Construct a frequency distribution for the data set using five classes. Include class limits, midpoints,
boundaries, frequencies, relative frequencies, and cumulative frequencies.
(b) Display the data using frequency histogram and a frequency polygon on the same axes in R.
(c) Display the data using a relative frequency histogram in R.
(d) Display the data using an ogive in R.
(e) Display the data. using a stem-and-leaf plot. Use one line per stem in R.
2. The elements with known properties can be classified as metals (57 elements), metalloids (7 elements),
halogens (5 elements), noble gases (6 elements), rare earth elements (30 elements), and other nonmetals
(7 elements). Displays the data using:
(a) a pie chart
(b) a Pareto chart
3. The height (in feet) and the numbers of stories of the ten tallest buildings in New York City are listed.
Use a scatter plot to display the data. Describe any pattern.
Height (in feet) 1776 1398 1250 1200 1079 1046 1046 1005 975 952
Stories 104 96 102 58 71 77 52 75 72 66
4. The US real unemployment rates over a 12-year period are listed. Use a time series chart to display the
data. Describe any patterns.
Year 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
Rate 9.3% 8.4% 8.4% 9.2% 14.2% 16.7% 16.2% 15.2% 14.5% 12.7% 11.3% 9.9%
Page 11