CHAPTER 1: DESCRIPTIVE STATISTICS
1.1 Introduction
Example 1: Making Steel Rods
Consider a machine that makes steel rods for use in optical storage
devices. The specification for the diameter of the rods is 0.450.02
cm. The machine makes 1000 rods per hour (continuous flow
production). The engineer wants to be fairly certain that the
percentage of good rods is at least 90%; otherwise he will shut
down the process for recalibration.
Example 2: Comparison of Breaking Strength of Two Alloys
In order to compare the strength qualities of two alloys, five
specimens from each group were selected randomly and their
breaking strength (force required to rupture a specimen in a tension
test) in megapascals was measured.
The following data were obtained:
Alloy A Alloy B
404 365
406 452
396 378
392 461
402 344
1
Alloy A
Alloy B
Which alloy seems to be the better in terms of its strength?
Study components: obtaining a random sample, collecting data
(obtaining trustworthy measurements), data analysis and
conclusions (generalization from the sample to the whole
population).
1.2 Applications of Statistics in Engineering
Quality control (in manufacturing operations randomly sampling
and testing a fraction of the output; process can be corrected
before a large number of defective items is produced)
Example 3: Filling process
Filling machine fills plastic bottles with a drink; random
sampling used to control the amount of drink in bottles; filling
process can be corrected before it creates a large number of
underfilled or overfilled bottles.
Reliability (ability of a device or system to perform a required
function under stated conditions for a specified period of time;
how long a component or a system will survive)
2
Example 4: Fatigue Tests for Aircraft Wheels
Suppose time until failure of wheels used on commercial aircraft
needs to be estimated. The wheels are made of very strong alloys
able to support aircraft on the ground.
A machine is used to roll the wheel under the desired design load.
The times until failure for wheels randomly selected from the
production run (in thousands of km) are obtained. The data are
used to estimate time until failure for all wheels.
Example 5: Data Mining in Oil and Gas Extraction
Digitization of oil fields: surface rock and soil type, seismic data
(creating shock waves that pass through hidden rock layers and
interpreting the waves that are reflected back to the surface),
satellite images, small core samples obtained by shallow drilling.
Statistical models used to find economically viable fields.
1.3 Random Sampling
Population - the entire collection of individuals or objects about
which information is desired
Sample - the collection of individuals or objects we will actually
measure
Generalization
Sample Population
3
Inferences - statements about the population based on the sample
data
Valid inferences about population can be reached if sample is
representative of the population.
Random sample - a sample in which the elements are chosen at
random (random sample is representative of the population). Larger
random samples give more accurate results than smaller samples.
1.4 Variables
Variable any characteristic of a person or thing that can be
expressed as a number or a label.
Variables
Categorical Quantitative (numerical)
(values are labels) (values are numbers)
Categorical variables: gender, hair color, marital status
Quantitative variables: weight, height, age, income
4
1.5 Displaying Categorical Variables
Consider a class of 30 with 18 males and 12 females.
(a) Bar Graph
A vertical bar erected over each category; the height of the bar is
the frequency or the percentage of observations in the category.
Percent
60
40
Females Males
(b) Pie Chart
Females
Males
Slices represent categories;
size of each slice corresponds
to the percentage for the
category
5
1.6 Describing Quantitative Variables
(a) Measures of Center
Sample Mean
Suppose a sample consists of n observations x1, x2, , xn . The
sample mean x is defined as
x1 x2 ... xn
x
n
x .
In compact notation: x
i
Sample mean x is an estimate of the population mean (the
mean of all observations in the whole population).
Example 6: 30 30 40 50 50 60 40 x 300 / 7 42.857
30 30 40 50 50 60 340 x 600 / 7 85.714
Conclusion: The mean is not resistant measure of center (very
sensitive to outliers, observations that fall well below or above
the overall bulk of the data).
Sample Trimmed Mean
Delete some of the smallest and some of the largest observations
(usually bottom 10% and top 10% removed) and take the mean
of the remaining observations.
6
The Sample Median
To compute the median:
(i) Arrange all observations in order, from smallest to largest
(ii)
The single middle value if n is odd,
Median =
The average of the two middle values if n
is even.
Example 7:
Data set 1 : 30 60 40 30 50 40 50 Median =40
Ordered list: 30 30 40 40 50 50 60
Data set 2: 30 60 40 30 50 48 50 40 Median =(40+48)/2
Ordered list: 30 30 40 40 48 50 50 60
The median remains the same if 60 replaced by 600.
Conclusion: The median is a resistant measure of center.
Sample Mode
The sample mode is the most frequently occurring observation in
the sample (no mode if the observations occur with the same
frequency).
7
(b) Measures of Spread
Sample Range
Range = Largest Smallest.
Range ignores all of the information between the largest and the
smallest values.
Variance and Standard Deviation
x x1
x1 x
Observations: x1, x2, , xn
Variance s2 is defined as
( x1 x )2 ( x2 x ) 2 ...( xn x ) 2
s
2
.
n 1
s 2
(x x )
i
2
.
Compact notation: n 1
Standard deviation s: s s
2
8
Properties of s:
1. Measures the spread of observations about the mean.
2. s is not resistant to outliers.
3. s=0 only when there is no spread (all observations equal).
4. s is an estimate of population standard deviation
(standard deviation of all observations in the population).
Example 8: Compute the variance and standard deviation of the
observations: 20, 40, 50, 30, 60, 70
Solution:
Equivalent formula for the variance:
n
2
1 n 2
( xi )
s
2
xi i 1
.
n 1 i 1 n
Example 9: Use the above formula to recalculate the standard
deviation for the above data.
9
Sample Quantiles
The p th sample quantile is a value such that p percent of the
observations fall below or at that value.
Three useful quantiles are quartiles. The lower (or first) quartile has
p=25, the median (or second) quartile denoted by Q2 has p=50, and
the upper (or third) quartile has p=75.
They are denoted by Q1, Q2, and Q3 , respectively.
M=median= Q2
LOWER HALF UPPER HALF
Lower Upper Lower Upper
n even n odd
10
Q1 = Median of the Lower Half
Q2 = Overall Median,
Q3 = Median of the Upper Half.
Q1 Q2 Q3
Interquartile range IQR: IQR = Q3 Q1.
IQR is a measure of spread in the data.
Example 10: Obtain the quartiles and IQR for the sample:
30 30 40 40 48 50 50 60 66 86 94 112
Solution:
11
1.7 Outliers
Outliers- observations separated from the main body of data
outlier
Outlier an observation 1.5*IQR below Q1 or 1.5*IQR above Q3
Q1 Q2 Q3
1.5 IQR 1.5 IQR
Example 11: Are there are any outliers in Example 10?
Solution:
12
1.8 Displaying Quantitative Variables
Example 12: 30 examination scores:
75 79 58 73 82 94
61 77 54 77 65 67
62 61 64 45 58 86
66 83 70 91 48 78
86 66 52 80 59 55
(a) Histograms
1. Divide the range of the data into non-overlapping classes of
equal width.
( )( )()( )(
Convention: Right-hand limit of each class is included, left-
hand limit is excluded (Excel).
2. Count the number of observations (frequency) in each class.
3. Erect over each class a rectangle whose height equals to the
frequency of that class.
13
Frequency Table:
Class Intervals Frequency Relative
Frequency
40-50 2 2/30
50-60 6 6/30
60-70 9 9/30
70-80 7 7/30
80-90 4 4/30
90-100 2 2/30
40 50 60 70 80 90 100
Frequency histogram for the 30 scores
14
9/30
7/30
2/30
40 50 60 70 80 90 100
Relative frequency histogram of the 30 scores
Shapes of histograms
Unimodal (one peak), bimodal (two peaks)
Shapes of histograms
Symmetric Skewed
Skewed right Skewed left
15
4
Frequency
2
Symmetric
4
Frequency
Skewed Left
4
Frequency
Skewed Right
16
(b) Boxplots
Outlier (more than 1.5
IQR above Q3)
The largest observation
within 1.5 IQR from Q3
Q3
IQR Q2
R Q1
The smallest observation
within 1.5 IQR from Q1
Outlier (more than 1.5
IQR below Q1)
Skewed right 17
Skewed left Symmetric
For symmetric distribution: Mean=Median=Mode.
Example 13: Obtain the boxplot for the 30 exam scores. Repeat the
exercise with the score 94 replaced by 120.
(c) Scatterplots
Scatterplots are used to display a relationship between two
numerical variables.
Sales
Price
18
(d) Time Series Plots (line charts)
Variable
| | | | |
Equally-spaced time intervals
Example 14: Lumber Cutting
Operator cuts 2-by-4 lumber into exactly 96-inch lengths using a
table saw. However, few pieces will be exactly 96 inches long.
Sources of variation in the cut lengths: most saw blades wobble,
lumber is at least slightly warped, cuts become less precise as the
saw blade becomes duller. The lengths (in inches) of 20 cuts are
given below:
Order Length Order Lengths
1 95.99 11 96.01
2 96 12 95.96
3 95.99 13 96.01
4 96 14 96.02
5 95.98 15 95.95
6 96 16 96.04
7 95.98 17 96.02
8 96 18 96.07
9 95.97 19 96.03
10 96.03 20 96.05
19
20