[go: up one dir, main page]

0% found this document useful (0 votes)
51 views49 pages

Summarizing Data

Uploaded by

masresha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views49 pages

Summarizing Data

Uploaded by

masresha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Summary measures

Samuel D[MPH in Epidemiology & Biostatistics]


Learning objectives

 At the end of this chapter, the student will be able to:

1. Identify the different methods of data summarization

2. Compute appropriate summary values for a set of data

3. Appreciate the properties and limitations of summary


values

2
Measures of Central Tendency
 One type of measure useful for summarizing data defines
the center or middle of the sample. This type of measure
is a measure of central tendency (location).

 Notations: Σ is read as Sigma (the Greek Capital letter


for S) means the sum of.

 Suppose n values of a variable are denoted as X 1 , X 2 ,


X 3 …., X n then ΣX i = X 1 ,+X 2 ,+ X 3 +…X n where the
subscript i range from 1 up to n.

3
Introduction
Example:
 Let X1=2, X2 = 5, X3=1, X4 =4, X5=10, X6= −5, X7 = 8
 Since there are 7 observations, i range from 1 up to 7
Ø i) (Σxi)= 2+5+1+4+10-5+8 = 25
Ø ii) (Σxi)2 = (25)2 = 625
Ø iii) Σxi2 = 4 + 25 + 1 + 16 + 100 + 25 + 64 = 235

4
1. Arithmetic Mean
 One measure of central location for this sample is the
arithmetic mean ; it is usually denoted by X̅ .

Definition:
 The arithmetic mean is the sum of all observations
divided by the number of observations. It is written in
statistical terms as:

5
Example:
 Suppose the sample consists of birth weights (in grams) of
all live born infants born at a private hospital in a city,
during a 1-week period.

 2069, 2581, 2759, 2834, 2838, 2841, 3031, 3101, 3200,


3245, 3248, 3260, 3265, 3314, 3323, 3484, 3541, 3609,
3649, 4146

 What is the arithmetic mean for the sample birth weights?

6
 Solution

 = =

7
 The arithmetic mean is in general, a very natural measure
of central location. One of its principal limitations,
however, is that it is overly sensitive to extreme
values.

 It may not be representative of the location of the great


majority of the sample points.

 The arithmetic mean is a poor measure of central


location in these types of sample, since it does not reflect
the center of sample.

 Nevertheless, the arithmetic mean is by far the most


widely used measure of central location.

8
Characteristics of Mean
 Uniqueness(for a given sets of data there is one and
only one arithmetic mean).

 Simplicity(The arithmetic mean is simple to understood


and easy to compute).

 It can only be used with quantitative data.

 The mean will always exist, but it might not be an


actual data value.

9
Characteristics of Mean cont….

 Every data value has an effect on the mean, unlike the


median and the mode.

 It can be more sensitive to extreme data values than


the median.

 The sum of the deviations about it is zero.

10
Mean for a grouped data
 This method is applicable where the entire range of
available values or scores of the variable has been divided
into equal or unequal class intervals and the observations
have been grouped into a frequency distribution on that
basis.

 The value or score (x) of each observation is assumed to be


identical with the mid-point (Xc) of the class interval to
which it belongs.

 In such cases the mean of the distribution is computed as:

11
Example
Time(Hours) Frequency(A) Mid point(B) A*B

10-14 8 12 96
15-19 28 17 476
20-24 27 22 594
25-29 12 27 324
30-34 4 32 128
35-39 1 37 37
Total 80 1655

Therefore; the mean will be


1655/80=20.7 hours
12
Median

 An alternative measure of central location, perhaps


second in popularity to the arithmetic mean, is the
median.

 The rational for these definitions is to ensure an equal


number of sample points on both sides of the sample
median.

13
14
Median Cont…..
Example:
 Compute the sample median for the birth weight data

 First arrange in Ascending order

 2069, 2581, 2759, 2834, 2838, 2841, 3031, 3101, 3200, 3245,
3248, 3260, 3265, 3314, 3323, 3484, 3541, 3609, 3649, 4146

 Since n=20 is even,


 Median = average of the 10th and 11th largest observation =
(3245 + 3248)/2 = 3246.5 g

15
Median Cont…..

Example:
 Consider the following data, which consists of white blood
counts taken on admission of all patients entering a small
hospital on a given day.

 Compute the median white-blood count (×103 ).


7, 35, 5, 9, 8, 3,10,12, 8

Median?????
16
Median Cont…
Solution:
 First, order the sample as follows. 3, 5, 7, 8, 8, 9, 10,12, 35.

 Since n is odd, the sample median is given by the 5th, (9+1)/2)th,


largest point, which is equal to 8.

 The principal strength of the sample median is that it is


insensitive to very large or very small values.

 The principal weakness of the sample median is that it is


determined mainly by the middle points in a sample and is less
sensitive to the actual numerical values of the remaining data
points.
17
Characteristics of Median
 Uniqueness.

 Simplicity.

 It can only be used with quantitative data.

 It is the center of the data set in that: at least half of the


data value are greater than or equal to it, and at least half of
the data value are less than or equal to it.

 It can be less sensitive to extreme data values than the


mean.
18
Median in a Grouped data

 In the calculation of the median from a grouped frequency


table, the basic assumption is that within each class of the
frequency distribution, observations are uniformly or
evenly distributed over the class interval.
 It will be calculated as:-

Where
vL= True lower limit of interval containing the median(i.e. the median class
vW= Length of the interval
vN=Total frequency of the sample
v = Cumulative frequency of all interval below L
v =Number of observations to be counted off from one end of the distribution to
reach the median
v =Frequency of the interval containing the median
19
Example
Time(Hrs) Frequency(A) Cumulative Mid point(B) AxB
Frequency
10-14 8 8 12 96
15-19 28 36 17 476
20-24 27 63 22 594
25-29 12 75 27 324
30-34 4 79 32 128
35-39 1 80 37 37
Total 80 1655

Median??????

20
Solution
The Median class
= First class with (n/2)th cumulative frequency
= The first class whose cumulative frequency is at least 40.
= The class whose CF at least 40 is the 3rd class(20-40)
 LCB(L)= 19.5
 Frequency of the median(f50)= 27
 CF next below the median class(F1)= 36

= 19.5+0.7
=20.2 hours
21
Mode

 It is the value of the observation that occurs with the


greatest frequency.

 A particular disadvantage is that, with a small number of


observations, there may be no mode.

 In addition, sometimes, there may be more than one


mode such as when dealing with a bimodal (two-peaks)
distribution.

22
Find the modal values for the following data

A. 22, 66, 69, 70, 73 (No modal value)

B. 1.8, 3.0, 3.3, 2.8, 2.9, 3.6, 3.0, 1.9, 3.2, 3.5 (Modal value
= 3.0 kg)

23
Characteristics of Mode
 It can be used with quantitative and qualitative data.

 The mode won’t always exist, but if it does, it must be


an actual value of the data set.

 It might not be located near the center of the data


values.

 It can be less sensitive to extreme data values than


both the median and mean.
24
Skewness
 If extremely low or extremely high observations are present in a
distribution, then the mean tends to shift towards those scores.

 Based on the type of skewness, distributions can be:

A. Negatively skewed distribution: occurs when majority of


scores are at the right end of the curve and a few small scores
are scattered at the left end.

B. Positively skewed distribution: Occurs when the majority of


scores are at the left end of the curve and a few extreme large
scores are scattered at the right end.

C. Symmetrical distribution: It is neither positively nor


negatively skewed. A curve is symmetrical if one half of the curve
is the mirror image of the other half.
25
 In unimodal (one-peak) symmetrical
distributions, the mean, median and
mode are identical.

 The mean, median and mode occur in


alphabetical order when the longer tail
is at the left of the distribution.

 Or in reverse alphabetical order when


the longer tail is at the right of the
distribution.
26
Measures of Variation

 Some of the commonly used measures of dispersion


(variation) are: Range, Interquartile Range, Variance,
Standard deviation and Coefficient of variation.

27
Range
 The range is defined as the difference between the
highest and smallest observation in the data.

 It is the crudest measure of dispersion.

 The range is a measure of absolute dispersion and as such


cannot be usefully employed for comparing the variability
of two distributions expressed in different units.

28
Example:

 The range of data in set 1 is 70-30 =40


 The range of data in set 2 is 53-48 =5

29
Characteristics of Range
1) Since it is based upon two extreme cases in the entire
distribution, the range may be considerably changed if either
of the extreme cases happens to drop out, while the
removal of any other case would not affect it at all.

2) It wastes information for it takes no account of the entire


data.

3) The extremes values may be unreliable; that is, they are the
most likely to be faulty.

4) Not suitable with regard to the mathematical treatment


required in driving the techniques of statistical inference.

30
Quantiles
 Another approach that addresses some of the shortcomings of
the range is in quantifying the spread in the data set is the use
of quantiles or percentiles.

 Intuitively, the Pth percentile is the value Vp such that p percent


of the sample points are less than or equal to Vp.

 The median, being the 50 th percentile, is a special case of a


quantile.

 As was the case for the median, a different definition is needed


for the P th percentile, depending on whether np/100 is an
integer or not.

31
Definition:
 The pth percentile is defined by

1. The (k+1) th largest sample point if np/100 is not an


integer (where k is the largest integer less than np/100)

2. The average of the (np/100)th and (np/100 + 1)th largest


observation is np/100 is an integer.

32
Example:
Compute the 10th and 90th percentile for the birth weight data.
 2069, 2581, 2759, 2834, 2838, 2841, 3031, 3101, 3200, 3245,
3248, 3260, 3265, 3314, 3323, 3484, 3541, 3609, 3649, 4146

Solution:
 Since 20×0.1=2 and 20×0.9=18 are integers, the 10th and 90th
percentiles are defined by
 10th percentile = the average of the 2nd and 3rd largest values =
(2581+2759)/2 = 2670 g
 90th percentile=the average of the 18th and 19th largest values
= (3609+3649)/2 = 3629 grams.
 We would estimate that 80 percent of birth weights would fall
between 2670 g and 3629 g, which gives us an overall feel for
the spread of the distribution.

33
 Other quantlies which are particularly useful are the
quartiles of the distribution.

 The quartiles divide the distribution into four equal parts.

 The second quartile is the median.

 The interquartile range is the difference between the first


and the third quartiles.

34
 To compute it, we first sort the data, in ascending order,
then find the data values corresponding to the first
quarter of the numbers (first quartile), and then the third
quartile.

 The interquartile range (IQR) is the distance (difference)


between these quartiles.

35
Eg.
 Given the following data set (age of patients):
18, 59, 24, 42, 21, 23, 24, 32
 Find the interquartile range!

1) Sort the data from lowest to highest


2) Find the bottom and the top quarters of the data
3) Find the difference (interquartile range) between the
two quartiles.

36
18 21 23 24 24 32 42 59

 1 st quartile = The {(n+1)/4} th observation =(2.25) t h


observation = 21 + (23-21)x .25 = 21.5

 3 r d quar tile = {3/4 (n+1)} t h obser vation =(6.75) t h


observation = 32 + (42-32)x .75 = 39.5

 Hence, IQR = 39.5 - 21.5 = 18

37
Variance
 It measures how far a set of numbers are
spread out.
 It is a numerical value and is used to indicate
how widely individuals in a group vary.

 If individual observations vary considerably


from the group mean, the variance is big and
vice versa.

 A variance of zero indicates that, all values are


identical.
38
Variance for ungrouped data

39
Example:
 Areas of sprayable surfaces with DDT from a sample of
15 houses are as follows (m2) :

 101,105,110,114,115,124,125, 125, 130, 133, 135, 136, 137,


140, 145

 Find the variance of the above distribution.

 The mean of the sample is 125 m2 .

40
 Variance (sample) = S2 = Σ(xi –x)2/n-1

= (101-125)2 + (105-125)2+ …. (145-125)2} / (15-1)

= 2502/14 = 178.71m2

41
Variance for grouped data
Sample variance Population variance

Where,
f = frequency of the class
m = midpoint of the class

42
Example for a sample of 80 study subjects
Time(Hours) Frequency(A) Mid point(B) A*B

10-14 8 12 96
15-19 28 17 476
20-24 27 22 594
25-29 12 27 324
30-34 4 32 128
35-39 1 37 37
Total 80 1655

Variance???

43
Example for a sample of 80 study subjects
Time(Hours) Frequency(A) Mid point(B) A*B (B- )2 A*D
(C) (D)

10-14 8 12 96 75.69 605.52


15-19 28 17 476 13.69 383.32
20-24 27 22 594 1.69 45.63
25-29 12 27 324 39.69 476.28
30-34 4 32 128 127.69 510.76
35-39 1 37 37 265.69 265.69
Total 80 1655 2287.2
The mean will be 1655/80 = 20.7 hours

= 8(12 − 20.7)2 + 28(17 − 20.7)2 + 27(22 − 20.7)2 ………1(37 − 20.7)2


80 − 1

44 2287.2/79 = 0.25
Standard Deviation and Variance

 The sample and population standard deviations denoted


by S and σ (by convention) respectively are defined as
follows:

45
 This measure of variation is universally used to show the
scatter of the individual measurements around the mean
of all the measurements in a given distribution.

 Note that the sum of the deviations of the individual


observations of a sample about the sample mean is always
0.

46
Example:
 Areas of sprayable surfaces with DDT from a sample of
15 houses are as follows (m2) :

 101,105,110,114,115,124,125, 125, 130, 133, 135, 136, 137,


140, 145

 Find the standard deviation of the above distribution.

 The mean of the sample is 125 m2 .

47
 Variance (sample) = S2 = Σ(xi –x)2/n-1

= (101-125)2 + (105-125)2+ …. (145-125)2} / (15-1)

= 2502/14 = 178.71m2

 Hence, the standard deviation = = 13.37m2

48
49

You might also like