1 Summarizing
Data
(Ch 1.1, 1.3, 1.10-1.13, 2.4.3, 2.5)
Populations and Samples
An investigation of some characteristic of a population of
interest.
Example: You want to study the average GPA of juniors
who are engineering majors.
Population:
All engineering majors who are juniors.
Characteristic of interest:
Average GPSA.
2
Populations and Samples
What statisticians need to do:
1) Learn about the distribution of the characteristic in
the population.
2) Do this by taking a sample from the population.
Why?
3) Sample statistics.
3
Graphics: Histograms
A histogram is a graphical representation of the
distribution of numerical data.
Construct a histogram:
1. “Bin” the range of values. (The bins are usually
consecutive, non-overlapping, and are usually equal
size.)
2. Frequency histogram: count how many values fall into
each bin/interval and draw accordingly.
3. Density histogram: count how many values fall into
each bin, and adjust the height such that the sum of the
area of each bin equals 1.
4
Graphics: Histograms
Examples:
- Drawing a frequency histogram by hand.
- Drawing a density histogram by hand.
5
Example
Charity is a big business in the United States. The Web site
charitynavigator.com gives information on roughly 5500
charitable organizations.
Some charities operate very efficiently, with fundraising and
administrative expenses that are only a small percentage of
total expenses, whereas others spend a high percentage of
what they take in on such activities.
6
Example cont’d
Here are the data on fundraising expenses as a percentage
of total expenditures for a random sample of 60 charities:
6.1 12.6 34.7 1.6 18.8 2.2 3.0 2.2 5.6 3.8
2.2 3.1 1.3 1.1 14.1 4.0 21.0 6.1 1.3 20.4
7.5 3.9 10.1 8.1 19.5 5.2 12.0 15.8 10.4 5.2
6.4 10.8 83.1 3.6 6.2 6.3 16.3 12.7 1.3 0.8
8.8 5.1 3.7 26.3 6.0 48.0 8.2 11.7 7.2 3.9
15.3 16.6 8.8 12.0 4.7 14.7 6.4 17.0 2.5 16.2
7
Example cont’d
We can see that a substantial majority of the charities in
the sample spend less than 20% on fundraising:
8
Graphics: Histograms
Histograms come in a variety of shapes.
• Unimodal histogram: single peak
• Bimodal histogram: two different peaks
• Multimodal histogram: many different peaks
Bimodality: Can occur when the data set consists of
observations on two quite different kinds of individuals or
objects.
Multimodality
Symmetric histograms
Positively skewed histograms
Negatively skewed histograms
9
Sample Statistics
• Histograms and other visual summaries of samples are
excellent tools for informal learning about population
characteristics.
• The calculation and interpretation of certain summarizing
numbers are required for a deeper understanding of the
data.
•These sample numerical summaries are called “Sample
Statistics”
10
Sample Statistics: Measures of Centrality
Summarizing the center of the sample data is a popular
and important characteristic of a set of numbers.
3 popular types of center:
1. Mean
2. Median
3. Mode
11
The Sample Mean
For a given set of numbers x1, x2,. . ., xn, the most
familiar measure of the center is the mean
(arithmetic average).
Sample mean x of observations x1, x2,. . ., xn:
12
The Sample Mean
For a given set of numbers x1, x2,. . ., xn, the most
familiar measure of the center is the mean
(arithmetic average).
Sample mean x of observations x1, x2,. . ., xn:
Disadvantage?
13
The Sample Median
Median: Middle value when observations are ordered
smallest to largest.
14
The Sample Median
Median: Middle value when observations are ordered
smallest to largest.
To calculate: Order the n observations smallest to largest
(repeated values included and find the middle one.
15
The Mean vs. the Median
The population mean µ and median will not generally be
identical. If the population distribution is positively or
negatively skewed, as pictured below, then
(a) Negative skew (b) Symmetric (c) Positive skew
Three different shapes for a population distribution
Which population characteristic is most important?
16
The Mean vs. the Median
The population mean µ and median will not generally be
identical. If the population distribution is positively or
negatively skewed, as pictured below, then
(a) Negative skew (b) Symmetric (c) Positive skew
Three different shapes for a population distribution
17
Other Sample Measures
• Quartiles: divide the data set into four equal parts (how
is this calculated?)
• Percentiles: A data set can be even more finely
divided. What does “percentile” mean?
Example calculations of the median and quartiles.
18
Graphics: Boxplots
A boxplot is a convenient way of graphically depicting
groups of numerical data through the five number
summary: minimum, first quartile, median, third quartile,
and maximum.
Example: Drawing a boxplot by hand.
19
Variability
So far, we’ve learned techniques for visualizing our data
and measures of center. What about how far apart the
data is spread out?
Samples with identical measures of center but different amounts of variability
20
Variability
Simplest measure of variability: The range.
Samples with identical measures of center but different amounts of variability
21
Variability
Simplest measure of variability: The range.
Samples with identical measures of center but different amounts of variability
What are the disadvantages of the range?
22
Variability
Can we combine the deviations into a single quantity by
finding the average deviation?
A more robust measure of variation takes into account
deviations from the mean
23
Variability
The sample variance, denoted by s2, is given by
The sample standard deviation, denoted by s, is the
(positive) square root of the variance:
Note that s2 and s are both nonnegative. The unit for s is
the same as the unit for each of the xi.
Example: Calculation of the SD.
24
Summarizing Data in R
- Summary statistics
- Graphics (boxplots, histograms)
25