Data
descrip,on – distribu,on
parameters
Dr Nguyen Thi Van Anh
Department of Biotechnology-Pharmacology
University of Science and Technology of Hanoi
Data don’t make any sense.
We will have to resort to staEsEcs
Objec,ves of lesson
• Understand how data can be appropriately organized
and displayed
• Be able to calculate and interpret measures of central
tendency (median, mean, mode)
• Be able to calculate and interpret measures of
dispersion (range, variance, standard devia,on)
What is Descrip,ve sta,s,cs?
Techniques for summarizing and organizing the data
so that we may more easily determine
what information they contain
Descrip,ve sta,s,cs
Content:
• The ordered array
• Grouped data
• Frequency tables
• The central tendency
• DistribuEon and skewness
• The dispersion
1. The ordered array (sor,ng data)
List the data in order from
the smallest to the largest values
2. Grouped data
• To group a set of observations we select a set of contiguous,
nonoverlapping intervals such that each value in the set of
observations can be placed in one, and only one, of the
intervals
• 5 ≤ number of class intervals ≤ 15
2. Grouped data
A formula given by Sturges can be used as a guide
k: number of class intervals
n: number of values
w: class interval width
R: difference between the smallest and biggest value
Sturge’s formula - example
3. Frequency tables
Frequency: number of occurrences
Types of Rice
Rice Frequency
Old 17
TradiEonal 15
New 4
A table like this is called
A frequency distribu,on
Propor,on and percentages
• Proportion (relative frequency): is the frequency divided by the total
number of values
• Percentage: proportion multiply by 100
• Percentages are easier to interpret than raw frequencies, so
frequency table are often augmented with an extra column of
percentages
Rice Frequency Rela,ve Percentage
frequency
Old 17 0.4722 47.22%
TradiEonal 15 0.4167 41.67%
New 4 0.1111 11.11%
Sturges formula and frequency table - example
Sturge’s formula and frequency table - example
Cumula,ve (rela,ve) frequency
We may sum, or cumulate frequencies, and relative frequencies
to obtain information within 2 or more contiguous groups.
In this case we have cumulative frequencies and
cumulative relative frequencies
Bar chart (Bar graph)
• The main graphical display of categorical data is a bar chart
• The height of each bar is equal to frequency (or relative
frequency) of that category
Histogram ( a special type of bar chart)
Display graphically frequency distribution or
relative frequency distribution.
QUIZ
Construct: a) A frequency table
b) The bar chart
QUIZ
The heights of students (in cm) in a class are as follows:
a) Build a frequency table, grouping data into classes
b) Plot the data on a graph
Sturge’s formular
b) A bar chart / Histogram
4. Measures of central tendency
• Mean
• Median
• Mode
Mean (average)?
Obtained by adding up all the values of a variable and dividing by
the number of the values
Ex: 15, 20, 21, 20, 36, 15, 25, 15
Mean = Sum/8 = 167/8 = 20.875
4. Measures of central tendency
Properties of Mean
- Uniqueness (Only one mean for a data set)
- Simplicity (easily understood and easy to compute)
- Affected by each value of the data set (Extreme values can
distort the mean)
Ex: 15, 20, 21, 19, 80
Mean = Sum/5 = 31 → not representative of the data set
4. Measures of central tendency
Median
- The value that divides the data set into 2 equal parts
- The number of value ≤ median = the number of value ≥ median
How to compute meadian?
Values are arranged in order of magnitude (n = number of values)
- If number of values is odd, median will be the middle value (n+1)/2th
- If number of values is even, median is the mean of 2 middle values
(n/2)th and (n/2+1)th
4. Measures of central tendency
Median
Example:
1 2 4 4 5 6 6
Median is 4 (the 4th value)
1 2 4 4 5 6 6 7
50% below 50% above
2 middle values: 4 and 5 (4th and 5th value)
Median = 4.5 = (4+5)/2
4. Measures of central tendency
Properties of Median
- Uniqueness (Only one mean for a data set)
- Simplicity (easily understood and easy to compute)
- Not as drastically affected by each extreme values as the mean
4. Measures of central tendency
Mode
- The value that occurs most frequently in the data set
- If all values are different, there is no mode
- A set of data may have more than 1 mode
5. Distribu,on and skewness
The distribu,on gives informa,on about
• a typical value (a center) which data are spread
• the variability of values (the spread of distribu,on)
• Shape of distribu,on (whether a distribu,on is symmetric or
skewed,…)
Skewness
Symetric distribution: right half is a mirror image of left half
Asymetric distribution = skewed distribution
Mean < mode
Mean > mode
NegaEve skew PosiEve skew
6. Measures of dispersion
• Dispersion (variation/spread): the variability of the data
• No variability (all values are the same) means no dispersion
• When values are close together è small dispersion
Range
Difference between the largest and smallest value
Example: Calculate Mean, median and range?
Both A and B: Mean = Median = 7
Range of set A: 13 – 1 = 12
Range of set B: 9 – 5 = 4
Variance
A measure of how far each value in the data set is different from the mean
Population variance
Sample variance
Degrees of freedom = n - 1
Variance - Example
Compute mean, variance of a data set:
Standard deviation
The square root of variance
A measure of variation in one data set
Coefficient of Variation
Expresses the standard deviation as a percentage of the mean
• A measure of relative variation rather than absolute variation
• Can be used to compare variability of 2 or more data sets
measured in different units
Coefficient of Variation
Variation is much higher in the sample 2 than in the sample 1
Percentiles and quartiles
p% of observations < P < (100-p)% of observations
• Quartiles
• Quartiles
Lower quartile 25th percenEle
Median
50th percenEle
Upper quartile 75th percenEle
The interquar,le range (IQR): difference between first and third
quarEle
Small IQR: Small variability
Lower quartile Median Upper quartile
Example:
Calculate quartiles: 5, 7, 4, 4, 6, 2, 8
Example:
Calculate quartiles: 8, 7, 1, 3, 6, 3, 4, 5, 6, 8
Boxplot
• Box plot
Represent 5 values graphically
• Split data set into 4 quarters with
equal number of values
• What does the boxplot tell about the
distribution?
ü Center (median): the vertical line inside the box indicates the
center of the distribution
ü Spread: the width of the box (interquartile range IQR)
ü Shape: median in the middle of the box è symetric
distribution
• Construction of a boxplot? (5 steps)
Oulier:
- Value is more than 1.5 Emes the IQR from the box
or è outlier
Outside the range [Q1 – 1.5(IQR), Q3 + 1.5(IQR)]
ü Detection of ouliers is important (may be incorrectly recorded)
ü If anything atypical found, ouliers should be deleted from the data
• Example:
Kriesel et al. examined glomerular filtration
rate (GFR) in 19 pediatric patients (some
measured more than once). Compute:
a) Mean, median, variance, standard
deviation, coefficient of variance
b) Construct a boxplot
QUIZ
• What are the objectives of descriptive statistics?
QUIZ
Michelson in 1882 determined the speed of light:
Calculate: a) mean
b) Variance
c) Standard deviaEon
d) Coefficient of variance
QUIZ
Answer:
QUIZ
Suppose that the range of a sample is 105.4,
The calculated SD = 260.6
What do you conclude?
Answer:
There was an error in the calculaEon of SD
SD is the measure of distance of sample values from the mean.
Thus mean, SD must fall in the range.
QUIZ
QUIZ
Answer: