03 The Nature of Data

The Nature of Data
The Nature of Data

 Data is derived from objects, situations or Phenomenon.
 Data is used to classify, describe, improve or control

objects, situations or phenomenon
 Data can be captured that uses some kind of continuous

scale for differentiation. In other words, the scale used can
be subdivided into meaningful increments of precision.
 But we can also capture data by simply counting the

frequency of occurrence. This data can’t be subdivided and
is therefore said to be discrete .
Copyright Route Six Sigma, LLC 2003 2

Types of Data
Processes generate two different types of data
Continuous Data Discrete Data
Variable data Countable Data
Measurement data Attribute data
Pass/Fail
This data uses some sort of This data records the count of
measurement scale and occurrences of something
can be subdivided into happening or not happening.
increments Discrete data cannot be
meaningfully subdivided.

Populations and Samples
Data for making decisions can be captured from two
different sources, the Population and a Sample of the
population.
Population (N). Sample (n)
A set, or collection of all possible A subset of a population. In
objects or individuals of interest. Statistics, we’ll deal with a “random
This includes measurements of a sample”, or a sample that is so
specific parameter or characteristic
of a set, or collection of all possible chosen that every such sample
objects or individuals. chosen has an equal chance of
A population might be all the bags being selected.
of potato chips coming out of a If we select a random sample of 10
chip factory. bags of potato chips from the chip
Population data is generally not factory, we need to select the
available, or it is to expensive to sample in such a way that it could
collect and use. come from any of the produced
bags of potato chips.
We’ll use sample data to make decisions about the population
Continuous Data
 Measurements that can be
infinitely divisible on a scale
or continuum.
 i.e. Size; height; weight;
temperature; decibels; money.

All Processes Have Variation
 Systematic variation  Random variation
 Measurement differences that  Measurement differences
are expected and predictable that are not predictable
 We expect temperatures to be  A race horse will gallop five
warmer in Louisville Kentucky in furlongs with a different
the summer than in the winter time on different days
We expect data to vary, and, if it didn’t, we’d question how accurate it is,
but, because it varies, it makes using data for decision making a little more
challenging.
We normally won’t just use one data point for a decision, but rather collect
multiple pieces of data, and we’ll manage that collection to minimize
variation.
Thus variation is natural and expected, and it is the foundation of statistics
Precise or Accurate – Which
Way?
Data can be precise (small
variation) but not accurate
like these arrows on the
target
Or it can be
accurate but
lack precision
(large variation)
The primary Sources of Variation
Inadequate Design Margin
Process Capability Analysis for Camshaft Support
LSL Target USL

Process Data
USL 602.000
Within
Target 600.000
LSL 598.000 Overall
Mean 599.548
Sample N 100
StDev (Within) 0.576429
StDev (Overall) 0.620865
Potential (Within) Capability

Cp 1.16
CPU 1.42
CPL 0.90
Cpk 0.90
598 599 600 601 602

Cpm 0.87
Overall Capability Observed Performance Exp. "Within" Performance Exp. "Overall" Performance
Insufficient
Pp 1.07 PPM < LSL 10000.00 PPM < LSL 3621.06 PPM < LSL 6328.16
PPU 1.32 PPM > USL 0.00 PPM > USL 10.51 PPM > USL 39.19
Unstable Parts
PPL 0.83 PPM Total 10000.00 PPM Total 3631.57 PPM Total 6367.35
Ppk 0.83
Process
and Material
Capability
Types of Variation
Common cause Special Cause
 Unknown or chance cause of  Causes that are distinct and
variation inherent in any assignable to a specific element or
process. It is not controllable input to a process.
with the technology used in  These causes are generally
the process. controllable with the existing
 It is also known as residual or technology.
background noise.  These causes will effect the
 It limits the achievable variation in the process output
variation in the process. So, over time.
the common cause variation in  Special Causes are often
a process represents the best a categorized by 5 M’s
process can be, from a  Manpower
variation perspective.  Machinery
 Control or improvement of  Method
common cause variation  Measurement
requires action on the system  Materials
or process.  Environment

What is Statistics
Statistics is the science of

collecting data, classifying it,
graphing interpreting and
analyzing that data to derive
information and make
decisions about the data as
well as the system from which
the data came.

Types of Statistics
 Descriptive statistics  Inferential Statistics

include statistical data that describes deal with drawing conclusions about a
the present condition. Just as we population using information drawn
describe a person we met, or a from a sample of the population.
movie we have seen, we can use Inferential statistics is the science that
data to describe a process output or enables the news media to make
a defect occurrence. predictions about election campaigns.
Of course, inferential statistics have
limitations as experienced in the 2000
presidential elections.
Probability is the Link

between Descriptive Statistics and Inferential Statistics
The Quincunx
 Latin: Quinque Uncia
 Five ounces
 The quincunx is a training tool
that simulates a stable
process. While your process
may have a formed or
punched part or a cycle time
as an output, the quincunx's
output is a marble falling into
numbered slots that range
from 44 to 56. Also, like most
processes, the quincunx has
adjustments for centering or
targeting the output.

Normal Distribution
 The normal distribution is defined by two parameters: its mean and
standard deviation. It is a continuous, bell-shaped distribution
which is symmetric about its mean and can take on values from
negative infinity to positive infinity.
 The equation for calculation of the density is given as:

x  m  2
f x  
1 2s
2
e
s 2
 Note that calculation of expected value for the density function
requires knowledge of the value of the x, the value of the mean (m)
and the value of the standard deviation (s).

Describing Data Distributions
 Shape
This reflects the pattern of variation. Is it Symmetrical around the
mean, peaked, etc.
 Location or Central Tendency
This measurement indicates the center or midpoint of the
distribution of data. The most common of these is mean or average,
but median, or middlemost point is also used. Mode is still another
indicator, but one not generally used in statistics.
 Dispersion (Spread)
This measure provides information about the variability of the data.
The most common measures of variability are data range, data
variance and data standard deviation.

Distribution Shape
 Normal - the normal distribution is one of the
most important distributions used in probability.
It is useful for describing a variety of random
processes such as student test scores or the
size of metal parts.
 Poisson - The Poisson distribution is useful to

describe situations concerned with counting the
number of times that a certain type of event
occurs within a specified opportunity frame
such as a period of time or a physical region or
part.
 Binomial – The Binomial Distribution describes
the data that arises from counts or proportions
which are realizations of a discrete random
variable such as how many times an event
occurs in n repetitions of an experiment.
Measures of Central Tendency
 Average or Mean
The mean of a set of n values is simply the sum of all the
values divided by n
n
x i
X i 1
n
Where X represents the name of the variable being observed
and xi represents the ith value of x in the set of data. S
represents “sum of” and X represents the mean of the xi’s.
Note that m is used in lieu of X when the mean is of the
entire population.

 Median
The median is the value of the middlemost term on a distribution. If
the data set has an odd number of data points, the median is the
middle term. If the number of data points is an even number, the
median is the mean of the two middle values.
Example:
For the data set 5,7,8,9,12,15,16, the median is “9”.
For the data set 5,7,8,9,12,16, the median is 8.5 (the average of 8
and 9.

Mode
The mode is the value that appears most frequently in a set of data.
It’s not used much in statistics.
Example: Shoe sizes sold today

7.5, 8, 7.5, 10,10.5, 11, 11.5, 10.5, 9, 10.5, 8
Count the number of times each size appears. 10.5 is the mode.

Measures of Dispersion
 Range
The simplest of dispersion measurements. The range is simply
the measured difference between the largest measurement
and the smallest.
Deviation
The deviation is the difference between the measured value and
the mean of the data set from which drawn.
 x or
x
i
x i
x
Variance
The variance is the sum of the squares of the deviation divided by
the number N in the population, or the degrees of freedom (n-1) for
a sample set. N 2 2 n 2 2
 (x )m (x ) x
2
s 
i 1
N
i
s 2 i 1 i
n 1
Measures of Dispersion
 Standard Deviation
 This is the most common means for measuring
Histogram of normal2, with Normal Curve
dispersion in statistics. The standard deviation is
simply the square root of the variance. It is denoted
by s for a population and s or s hat for a sample.
20
Mean (m) Standard
Deviation (s)
 x i  m   x i  x 
N 2 n 2
Frequency
s i 1
s  sˆ  i 1
10 N n 1
45 50 55
normal2

Discrete Data
Counted data
Number of orders processed
Number of defects
Product Attributes or Levels

First Class Tickets; Technical degreed people on staff
This is commonly counted data
Artificial scales
Likert Scales
Good; Better; Best
Agree; Neutral; Disagree

Binomial Distribution
 Binomial Distribution is commonly associated with binary data (two possible
outcomes- good/bad; pass fail).
 The sampled trials are identical; I.e., the same trial is performed under identical
conditions.
 The trials are independent – the outcome of one trial does not effect or influence the
outcome of the second trial.
 The probability (p) of “success” on each trial is the same.
 Where n is the number of items sampled and X is the number of items having
a defect
P (x )  p (1  p )n  x
x
 Notice that the distribution is very slightly skewed.

 This distribution is the basis for the common forms of attribute control charts.
0.3
Density
0.2
0.1
0.0
Copyright Route Six Sigma, LLC 2003

0 5
binomial
10
22
Poisson Distribution
The Poisson distribution is most commonly associated with modeling
the number of random occurrences of some phenomenon in a specified
space or time.
To calculate the probability of X random occurrences of an event if the
over time average is m , the Poisson distribution is described by the
equation
e m
m x
Where x=0,1,2,…
P( x)  and e=2.71828
x!
0 .3
0 .2
Density
0 .1
0 .0
0 5 10 15
p o is s o n
Defects per Unit
 Since we are counting to accumulate data, a
common thread will that it will be the count per
unit of measure.
 I.e., number of defects on a PC Board.
 It is common to report this data as
Defects per Unit (DPU).
 But a computer mother board will not have as
many opportunities for a defect as a Alcatel card
or a sun CPU
 Is it accurate to report defects as DPU in both
instances and compare the results?
DPU forms the foundation for Six Sigma using discrete data.
Defects per Opportunity
 To accurately compare discrete defect data from
different processes or products, it is appropriate to
include the number of opportunities for a defect to
occur in each unit as well as the number of units.
Defect count
Defect per opportunity 
Units * (opportunities per unit)
 Given this relationship, what should be our goal for
the opportunities?
 Reduce the number of opportunities
 Increase the ability of each opportunity to perform without
defects
DPO is the probability of a defect on any one CTQ or step of the process.

The Bead Box
 This bead box represents a
population of product or service
outputs. The white beads
represent good outcomes and
the colored beads represent
outcomes with some defect.
 We can use the power of

statistical sampling to learn the
relationship of defective
outcomes to good outcomes
without looking at every output
in the population.

Yield: A Common Discrete
Measure
How do you measure Yield?
Number of units that pass
Yield 
Number of units tested
Does this yield take into account all the defects generated
by your process?
What’s missing?
Would a yield measurement that looked at yield at each

operational element be more accurate?
Yield  Yield Op1 * Yield Op2 * Yield Op3 * ...* Yield Opn

Rolled Throughput Yield
 What is the probability of accomplishing a process error free?
Prob(good) Op1* Prob(good) Op2* Prob(good) Op3* Prob(good) Op n
 Does this look familiar?
 This is known as Rolled Throughput Yield
 Rolled Throughput yield takes into account the yield at each step
of completion, the yield for every opportunity for a defect.
 Complexity plays a major part here.
• 4s capability across 50 steps produces a rolled-throughput yield of .9937950
= .7324, or 73.24%.
• 4s capability across 100 steps produces a rolled-throughput yield of
.99379100 = .5364, or 53.64%.

03 The Nature of Data

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

03 The Nature of Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

03 The Nature of Data

Uploaded by

Copyright:

Available Formats

The Nature of Data

The Nature of Data

 Data is used to classify, describe, improve or control

 Data can be captured that uses some kind of continuous

 But we can also capture data by simply counting the

Copyright Route Six Sigma, LLC 2003 2

Copyright Route Six Sigma, LLC 2003 3

Copyright Route Six Sigma, LLC 2003 5

Process Capability Analysis for Camshaft Support

LSL Target USL

Potential (Within) Capability

598 599 600 601 602

Copyright Route Six Sigma, LLC 2003 9

Statistics is the science of

Copyright Route Six Sigma, LLC 2003 10

 Descriptive statistics  Inferential Statistics

Probability is the Link

Copyright Route Six Sigma, LLC 2003 12

Copyright Route Six Sigma, LLC 2003 13

Copyright Route Six Sigma, LLC 2003 14

 Poisson - The Poisson distribution is useful to

Copyright Route Six Sigma, LLC 2003 16

Copyright Route Six Sigma, LLC 2003 17

Example: Shoe sizes sold today

Copyright Route Six Sigma, LLC 2003 18

Copyright Route Six Sigma, LLC 2003 20

Product Attributes or Levels

Copyright Route Six Sigma, LLC 2003 21

 Notice that the distribution is very slightly skewed.

Copyright Route Six Sigma, LLC 2003

Copyright Route Six Sigma, LLC 2003 25

 We can use the power of

Copyright Route Six Sigma, LLC 2003 26

Would a yield measurement that looked at yield at each

Copyright Route Six Sigma, LLC 2003 27

Copyright Route Six Sigma, LLC 2003 28

You might also like