Data analysis
Feyera Senbeta (PhD)
The Meaning of Statistics
Several Meanings
Collections of numerical data
Summary measures calculated from a
collection of data
Activity of using and interpreting a collection of
numerical data
A Meaningful Statistic (Significant)?
Statistics, descriptive or inferential are NOT a
substitute for good judgment
Decide what level or value of a statistic is
meaningful
State judgment before gathering and analyzing
data
Examples:
Score on performance test of 80% is passing
Pre/post rules instruction reduces incidents by
50%
Interpretation of Meaning
Population Measure (statistic)
There is no sampling error
The number you have is “real”
Judge against pre-set standard
Inferential Measure (statistic)
Tellsyou how sure (confident) you can be
the number you have is real
Judge against pre-set standard and state
how certain the measure is
Statistics
Descriptive Statistics
Gives numerical and
graphic procedures to Inferential Statistics
summarize a collection Provides procedures
of data in a clear and to draw inferences
understandable way about a population
from a sample
Descriptive and Inferential
Statistics
Descriptive statistics: Mathematical methods (such as
mean, median, standard deviation) that summarize and
interpret some of the properties of a set of data (sample)
but do not infer the properties of the population from
which the sample was drawn.
Mathematical methods (such as hypothesis
development) that employ probability theory for deducing
(inferring) the properties of a population from the
analysis of the properties of a set of data (sample) drawn
from it.
6
Did it happen by chance?
How do you know if something caused or
correlates with something else?
The appropriate Statistic will tell you:
If there is a difference from some expected value
If the difference is statistically significant or merely
due to random chance
1/24/2013 7
Descriptive Statistics
Summarize or describe the
important characteristics of a
known set of population data
1/24/2013 8
Descriptive Statistics
Design Descriptive Statistics
Survey Studies Percentages, measures of
central tendency and variation
Causal comparative studies Measures of central tendency &
variation, percentages, standard
scores
Experimental Measures of central tendency &
variation, percentages, standard
scores, effect sizes
Types of descriptive statistics
Statistic is a quantitative index that describes
performance of a sample or samples
Parameter is a quantitative index describing the
performance of a population
Measures of central tendency are used to determine the
typical or average value among a group of values
Measures of variability indicate how spread out the
values are
1/24/2013 10
Descriptive Statistics (Vocabulary)
Central tendency
Mode
Median
Mean
Variation
Range
Standard deviation
Normal distribution
Standard score
Correlation
Regression
Descriptive Measures
Central Tendency measures. They are
computed to give a “center” around which the
measurements in the data are distributed.
Variation or Variability measures. They
describe “data spread” or how far away the
measurements are from the center.
Relative Standing measures. They describe
the relative position of specific measurements in the
data.
Measures of Central Tendency
Mean:
Sum of all measurements divided by the number
of measurements.
Median:
A number such that at most half of the
measurements are below it and at most half of the
measurements are above it.
Mode:
The most frequent measurement in the data.
Example of Mean
Measurements Deviation
x x - mean
MEAN = 40/10 = 4
3 -1
5 1
5 1 Notice that the sum of the
1 -3 “deviations” is 0.
7 3
2 -2
6 2 Notice that every single
7 3 observation intervenes in
0 -4
4 0
the computation of the
40 0
mean.
Example of Median
Measurements Measurements
Ranked
Median: (4+5)/2 =
x x 4.5
3 0
5 1
5 2
Notice that only the two
1 3
central values are used
7 4 in the computation.
2 5
6 5
7 6
The median is not
0 7 sensible to extreme
4 7 values
40 40
Example of Mode
Measurements
x
3
5 In this case the data have
5 tow modes:
1
7 5 and 7
2 Both measurements are
6
7 repeated twice
0
4
Example of Mode
Measurements
x
3
5
Mode: 3
1
1
4
7 Notice that it is possible for a
3 data not to have any mode.
8
3
Graphing data
Provides a quick view of the what your data is telling
you.
There are various types of graphs which are used in
statistics including bar graphs, histograms, scatter
plots, pie charts, frequency polygons etc.
1/24/2013 18
Example group of test scores
1/24/2013 19
Frequency Polygon and Pie Chart
1/24/2013 20
Common mistakes
Use of one dataset as graph & table
Use of one dataset as frequency and %
histogram
Which graph to use histogram, pie chart,
linear
The importance of making graph
Sample Bar Graph
1/24/2013 22
Sample Histogram
1/24/2013 23
Sample Scatter Plot
1/24/2013 24
Frequency Distributions
Frequency distributions are like frequency
polygons; however, instead of straight lines,
a frequency distribution uses a smooth
curve to connect the points and, similar to a
graph, is plotted on two axes.
1/24/2013 25
J Shaped Curve
1/24/2013 26
Bimodal Curve with Two Peaks
1/24/2013 27
Positively Skewed Bell Curve
1/24/2013 28
Negatively Skewed Bell Curve
1/24/2013 29
Symmetric Bell Curve/Normal
Distribution
1/24/2013 30
What is the Normal Distribution ?
•Where did it come from and why is it so special?
• It is just about anything you measure turns out
to be normally distributed, at least approximately
so.
•That is, usually most of the observations cluster
around the mean, with progressively fewer
observations out towards the extremes
1/24/2013 31
Sample Histogram
1/24/2013 32
Just about any histogram can be
converted into a line graph
1/24/2013 33
Which can be used to plot a
normal distribution
1/24/2013 34
But how do we get from the
normal to the standard normal?
1/24/2013 35
Measures of variability
Range – Difference between the highest and
lowest values (high value -low value = range)
Variance S2
Standard Deviation S
variation of values about the mean
1/24/2013 36
Measures of variation – range
Range= highest value-lowest value
Bank waiting time values:
Values of 4, 7, 7 the range is 7-4 or 3
With values of 1, 3, 14, the range is 14-1 or 13
1/24/2013 37
Other key measures of variation
S2= Variance
S Standard Deviation
1/24/2013 38
Measures of variation –
standard deviation
x
6
6
6
1/24/2013 39
The Z statistic will allow you to
standardize a normal
distribution
1/24/2013 40
Inferential Statistics
To generalize or predict how a large
group will behave based upon
information taken from a part of the
group is called INFERENCE
Techniques which tell us how much
confidence we can have when we
GENERALIZE from a sample to a
population
Inferential Statistics (Vocabulary)
Hypothesis
Null hypothesis
Alternative hypothesis
ANOVA
Level of significance
Type I error
Type II error
Collecting a random sample
Goal: to understand characteristics about a population
Examples:
What’s the average household income of the 09 Kebele
resident?
What proportion of people living in Dire Dawa Town have had
malaria?
Estimating the mean
One of the most common goals of statistical
inference is estimating a population mean
with a sample mean
Central Limit Theorem
When we have n independent, identically distributed
(X1..Xn) random variables, the mean of those random
variables approaches a normal distribution with mean =
µ and variance = 2 , as n gets large.
n
Independence of random variables means that the value
of one observation has no effect on the value of another
observation.
Identical distribution of random variables means that
each random variable comes from the same population
(e.g., roll of a die, coin flip).
Simple random sampling
Each observation drawn does not depend on others
drawn
Thus observations are independent
Each observation (i.e., each random variable) is
identically distributed
The population has a distribution that doesn’t change (each
observation is randomly drawn from an identical distribution –
the distribution of the population).
So the Central Limit Theorem applies!
(when n is large)
What does this mean?
Suppose we take a sample of n=50
observations from a population that frequency
has this distribution:
0 10 20 30
Mean (µ) = 20
2
Variance ( ) = 100
Std. dev ( ) = 10
We then find the mean of this sample (suppose this mean = 19). Take
another sample of 50 observations and find the mean (suppose it’s 24).
Do this many times, and we’ll come up with a distribution of means. The
Central Limit Theorem tells us this distribution will always look like the
next slide (as long as n is “large”, and 50 is large enough):
The normal curve
16 18 20 22 24
x
2
Mean (µ) = 20 Sample size (n) = 50 variance of sample mean = =2
n
Symbols
Population Parameter: µ
Estimate: ẋ
Expected: E
Basic Types of Inference
Point Inference
The value of a population parameter µ is estimated using a
single value ẋ
Examples: mean, standard deviation, etc.
Interval Inference
Attaching a probability to an estimate (i.e., making a
confidence interval)
Example: we are 95% confident that µ is between 10 and 20
Judging the Quality of the
Estimator
ˆ )and
Bias – the difference between E (Θ Θ
(i.e., Bias = E (Θ
ˆ )−Θ
)
Bias may be positive or negative (e.g., a
positively biased estimator would indicate the
population parameter is higher than it actually is)
Efficiency – how clustered the distribution of
is (i.e., how “peaked” is its distribution) Θ̂
Point Estimates (inferring population
parameters from samples)
Population Mean: µ=x
Population Proportions: π = P = X /n
Population Variance: σ 2 = s2
Population Standard Deviation: σ = s
Confidence Intervals
The degree of confidence we have in our estimates defined
by a percentage
Common examples: 90, 95, or 99% confident
The confidence interval is defined with the α symbol
In confidence intervals, alpha (α) is the proportion of time
your confidence interval is wrong
The typical usage is: zα / 2
Why do we divide by 2?
Confidence Interval Example
What is the 95% confidence interval for a normally distributed
variable?
α= 1 - desired confidence interval
α= 1 – 0.95 = 0.05
Remember that we divide α by 2 since we have uncertainty both
above and below the mean (i.e., 2 tails)
Therefore we use z0.025 for the 95% confidence interval
From the z-table we find that z0.025 = 1.96
What does this mean?
Interval Estimation (making confidence
intervals for population parameters estimated
from samples)
Case #1 estimating an interval for µ when X is
normally distributed and we know σ
This is the simplest case because normality
allows us to use the z-table
This is also unlikely since it requires knowing the
distribution and the σ (which implies knowing µ
already)
Example #1: Create a confidence
interval for µ
A town is considering building a new bridge over a
river. The primary goal is to reduce workers’
commute times from a particular community. A
random sample of workers in that community are
asked to estimate their reduction in commute time if
the bridge were built.
Our goal is to estimate the mean reduction in
commute time for the whole community if the bridge
were built. Create a 95% confidence interval for this
mean.
Example #1 Data
n = 100 workers are sampled
x = 17 minutes
σ = 30 minutes
What is the 95% confidence interval for
the mean?
Constructing a confidence interval
Construct a 95% confidence interval around the sample mean
σ σ
P( X − 1.96 ≤ µ ≤ X + 1.96 ) = 0.95
n n
30 30
P(17 − 1.96 ≤ µ ≤ 17 + 1.96 ) = 0.95
100 100
P(17 − 1.96 * 3 ≤ µ ≤ 17 + 1.96 * 3) = 0.95
P(17 − 5.88 ≤ µ ≤ 17 + 5.88) = 0.95
So we can say that the 95% C.I. is 17 +/- 5.88 or 11.12, 22.88
Example #1 Questions
What would happen to our interval if we
used a 99% confidence interval instead?
What would happen to our confidence
interval if we sampled 200 people instead
of 100 people?
Interval Estimation (making confidence
intervals for population parameters estimated
from samples)
Case #2 estimating an interval for µ when X is
not normally distributed and we know σ
In this case the n matters a lot, why?
This is also unlikely since it requires knowing the
distribution and the σ (which implies knowing µ
already)
Interval Estimation (making confidence
intervals for population parameters estimated
from samples)
Case #3 estimating an interval for µ when σ and
the distribution are unknown
What should we used instead of σ?
Can we use the z-table in this case?
This case is what we see most commonly
t-distribution vs. z-
distribution
When we only have s (and not σ) we use the t-
distribution rather than the z-distribution
To do so we use the t-table
How are they different?
The t-distribution changes depending on the degrees of
freedom (n-1)
This is reflected in the table and in the symbol tα / 2,n −1
The t-distribution accounts for more uncertainty (i.e., wider
confidence intervals) since s is just an estimate for σ
t-distribution vs. z-distribution
As n approaches infinity t and z become equal
This means that even when we have s instead of σ we can use the z-
distribution if n is large
Central Limit Theorem: “…as n gets large.”
What is “large”?
Rule of thumb: 30
For n less than 30, the distribution of x does not follow the normal
distribution accurately enough.
But the distribution of x does closely follow a t-distribution for sample
sizes of less than 30.
For this class use the t-distribution any time you have s instead of σ
Example #2
n = 16
x = 30
s2 = 1600
What is the 95% C.I. for the mean?
Example #2
s = 40
Degrees of freedom = n – 1 = 15
tα / 2,n −1 = t0.05 / 2,16 −1 = t0.025,15 = 2.131(from the t-table)
s s
P ( X − 2.131 ≤ µ ≤ X + 2.131 ) = 0.95
n n
40 40
P (30 − 2.131 ≤ µ ≤ 30 + 2.131 ) = 0.95
16 16
P (30 − 2.131 *10 ≤ µ ≤ 30 + 2.131 *10) = 0.95
P (30 − 21.31 ≤ µ ≤ 30 + 21.31) = 0.95
The 95% confidence interval for the mean is (8.69, 51.31)
Interval Estimation (making confidence intervals
for population parameters estimated from
samples)
Case #4 estimating an interval for a proportion π
based on a sample proportion p
Remember that p = x/n
In other word, p = the number of “successes” divided by
the number of samples
For example: the proportion of people over 6ft tall
In this case we don’t need s or σ, but we do need
the standard deviation of p: π (1 − π )
σp =
n
Which we estimate as: p (1 − p )
sp =
n
Interval Estimation (making confidence intervals
for population parameters estimated from
samples)
Case #4 continued
p (1 − p) p(1 − p )
Equation: p − zα / 2 ≤ π ≤ p + zα / 2
n n
We use the z-distribution for estimating an interval for a
proportion π based on a sample proportion p
This also limits us to using only large samples (in this case n >
100)
For smaller samples, we calculate the entire distribution using
the binomial mass function: P ( x ) = C xnπ x (1 − π )(i.e.,
n− x
solve for
all x values)
Example #3
n = 150 people at a convention
63 people sampled were over 6 feet tall
What is the 99% C.I. for the true
proportion of all people ≥6 ft tall at the
convention?
Example #3
p = 63/150 = 0.42
99% C.I. -> z α /2 = z0.005 = 2.58 (from the z-table)
p(1 − p) p(1 − p)
p − zα / 2 ≤ π ≤ p + zα / 2
n n
0.42 * 0.58 0.42 * 0.58
0.42 − 2.58 ≤ π ≤ 0.42 + 2.58
150 150
0.42 − 2.58 * 0.04 ≤ π ≤ 0.42 + 2.58 * 0.04
0.42 − 0.104 ≤ π ≤ 0.42 + 0.104
The 99% confidence interval for p = 0.42 is (0.316, 0.524)