Introduction to Biostatistics
Wondimu Ayele(Msc, PhD fellow )
SP, AAU
January 2019
Objective
– Define statistics and its importance in different
discipline
– Define variable and data
– Describe types of data and measurement scales
– Organize and display data
– Define and calculate measures of central tendency
and measures of spread
Biostatistics -Notes WA , SPH AAU ,2016
• References
1. M. Pagano & K. Gauvereau: principles of Biostatistics
2. Colton T. : Statistics in Medicine
3. Bland M. : An Introduction to Medical Statistics
4. Daniel W. : Biostatistics: A Foundation for analysis in Health
Sciences
5. David S. Moor, G.P.McCable: Introduction to the practice of
Statistics
6. Kleinbaum, K.Muller: Applied Regression Analysis and other
Multivariate Methods
7. L. D. Fisher & G. Van Belle: Biostatistics
8. Kirkwood B. : Essentials of Medical Statistic
Biostatistics -Notes WA , SPH AAU ,2016
9. A. R. Feinstein: Principles of Medical statistics
10. R. G. Knapp & M. C. Miler: Clinical epidemiology and
biostatistics
11. D. J. Sheskin: Hand book of Parametric and
Nonparametric Statistical Procedure
12. Armitage P. & Berry G. : Statistical Methods in Medical
Research
13. P. S.R.S. Rao: Sampling methodologies with application
14. R.N.Forthofer & E. S. Lee: Introduction to Biostatistics
Biostatistics -Notes WA , SPH AAU ,2016
Introduction
• What is Statistics?
• Methods for collecting, organizing, presenting,
analyzing, & drawing of inferences about a body
of the data when only a part of the data is
observed.
Biostatistics -Notes WA , SPH AAU ,2016
WHY WE NEED STATISTICS
• To present the data in a concise and definite form.
Statistics helps in classifying and tabulating raw data for
processing and further tabulation for end users.
• To make it easy to understand complex and large data
• This is done by presenting the data in the form of tables,
graphs, diagrams etc. or by condensing the data with the
help of means, dispersion etc
.
• For comparison : Tables, measures of means and
dispersion can help in comparing different sets of data..
Biostatistics -Notes WA , SPH AAU ,2016
WHY WE NEED STATISTICS
• In measuring the magnitude of a phenomenon.
• Statistics has made it possible to count the population of a
country, the industrial growth, the Agricultural growth, the
educational level, Health status.
• Everything in medicine, be it research, diagnosis or
treatment depends on counting/measurement.
– High/ Low B.P??
– Pulse rate.
– Incidence of disease.
– Death rate.
– Enlargement of liver/ spleen
Biostatistics -Notes WA , SPH AAU ,2016
Statistics and Health
• Biostatistics
• Health statistics
• Medical Statistics
• Vital Statistics
• Not Mutually Exclusive terms
what is Biostatistics ?
• An application of statistical method to biological
phenomena.
Biostatistics -Notes WA , SPH AAU ,2016
Why need biostatistics?
1. Main reason: handling variations
– Biological variation
• Attribute differ not only among individuals but also
within same individual over time
• Example: height, weight, blood pressure, eye color.
– Sample variation
• Biomedical research projects are usually carried out
on small numbers of study subjects
Biostatistics -Notes WA , SPH AAU ,2016
Why need to learn biostatistics?
2. Essential for scientific method of investigation
– Formulate hypothesis
– Design study to objectively test hypothesis
– Collect reliable and unbiased data
– Process and evaluate data rigorously
– Interpret and draw appropriate conclusions
3. Essential for understanding, appraisal and critique of
scientific literature
Biostatistics -Notes WA , SPH AAU ,2016
Examples of uses of biostatistics
• To define what is normal/ healthy in a population (Setting
limits of normality).
• To compare drug action –potency/efficacy
• Confirm association between two attributes: Cancer and
smoking or Socioeconomic status and malnutrition
• Usefulness of vaccines
Biostatistics -Notes WA , SPH AAU ,2016
Uses in Public Health Planning
• Recording of vital events
• Incidence/prevalence of disease.
• Leading causes of death/ morbidity in the community
• Demographic characteristics of a community.
• Health system research.
Biostatistics -Notes WA , SPH AAU ,2016
Application of Biostatistics
1. Genetically statistics
2. Numerical Taxonomy
3. Statistical Ecology
4. Statistical Ethnology
5. Forest menstruation
6. Forest and Agricultural yield table
7. Biomass estimation
8. Statistical environment management
9. Demography
10. Medical sciences
11. Biological variation and uncertainties
Biostatistics -Notes WA , SPH AAU ,2016
Limitation of statistics
• Statistics does not deal with individual measurements.
Since statistics deals with aggregates of facts, it can not
be used to study the changes that have taken place in
individual cases.
• Statistics cannot be used to study qualitative
phenomenon like morality, intelligence, beauty etc. as
these can not be quantified. However, it may be possible
to analyze such problems statistically by expressing them
numerically.
• Statistical results are true only on an average- The
conclusions obtained statistically are not universal truths.
They are true only under certain conditions. This is
because statistics as a science is less exact as compared
to the natural science.
Biostatistics -Notes WA , SPH AAU ,2016
Limitation of statistics
• Statistical data can be treated as approximations or
as estimates and not a precise measurement.
• Statistical results might lead to fallacious conclusions.
• Requires one who has a sound knowledge of
statistical methods can efficiently handle statistical
data.
Biostatistics -Notes WA , SPH AAU ,2016
Types of Statistics
Statistics
Probability
Sampling theory
Descriptive Statistics Inferential statistics
Measure of Measure of Test Estimation
Tabular Diagrammatic
Central Variability hypothesis Theory
representation representation
Tendency
Non Parametric Parametric Point Interval
test test estimation Estimation
Biostatistics -Notes WA , SPH AAU ,2016
Population & Sample
• Target population: A collection of items that have
something in common for which we wish to draw
conclusions at a particular time.
• Study Population: The specific population from which data
are collected
• Sample: A subset of a study population, about which
information is actually obtained.
• Generalizability is a two‐stage procedure: we want to able
to generalize from the sample to the study population and
then from the study population to the target population
Biostatistics -Notes WA , SPH AAU ,2016
Population and sample
• E.g.. In a study of the prevalence of HIV among Student in
Addis Ababa University, a random sample of all
pharmacy students in college of Health science of AAU
were included.
• Target population; all student in Addis Ababa University
• Study population; all student in college of Health science
of AAU
Sample; all Pharmacy student in Health science college of
AAU.
Sample
Study population
Target population
Biostatistics -Notes WA , SPH AAU ,2016
Parameter and Statistic
Parameter: A descriptive measure computed
from the data of a population.
Statistic: A descriptive measure computed from
the data of a sample.
Biostatistics -Notes WA , SPH AAU ,2016
Scales of measurement
• Clearly not all measurements are the same.
• Measuring an individuals weight is qualitatively different
from measuring their response to some treatment on a three
category of scale, “improved”, “stable”, “not improved”.
• Measuring scales are different according to the degree of
precision involved.
• There are four types of scales of measurement.
Biostatistics -Notes WA , SPH AAU ,2016
Scales of measurement
1. Nominal scale: uses names, labels, or symbols to assign
each measurement to one of a limited number of categories
that cannot be ordered.
Examples: Blood type, sex, race, marital status, Adolescence
stage, Color of cars.
2. Ordinal scale: assigns each measurement to one of a
limited number of categories that are ranked in terms of a
graded order.
• Examples: Patient status, Cancer stages, Socioeconomic
status, IQ of children.
Biostatistics -Notes WA , SPH AAU ,2016
Scales of measurement
3. Interval scale: assigns each measurement to one of an
unlimited number of categories that are equally spaced.
It has no true zero point.
Example: Temperature measured on Celsius or Fahrenheit
4. Ratio scale: measurement begins at a true zero point
and the scale has equal space.
Examples: Height, weight, blood pressure
Biostatistics -Notes WA , SPH AAU ,2016
• DATA: Collection of information, comprised either
individual or group.
Variables: A characteristic which takes different
values in different persons, places, or things.
Example:
Animals of the same species may differ in their Length,
weight, age, sex, Diastolic BP, heart rate, etc
Biostatistics -Notes WA , SPH AAU ,2016
Types of variable
Qualitative/ Categorical variable : records which
group or category an individual/observation belongs in;
classifies
• doesn’t make sense to perform arithmetic on this type of
variable
Example, gender, ethnic group, type of diagnosis as present or
absent, etc
Quantitative variable: Variable that has magnitude.
A true numerical value; it indicates an amount; often
obtained from a measuring instrument;
it makes sense to perform arithmetic on these types of
variables. E.g. Weight, Length, Age etc
Biostatistics -Notes WA , SPH AAU ,2016
Types of Variable
Discrete variable: It can only have a finite number
of values in any given interval.
– Indivisible units
– Restricted to whole numbers
– Can be counted
• Example.
– # of children in a family
– # of houses in a neighborhood
– # of patients discharged from the hospital on a given day
Biostatistics -Notes WA , SPH AAU ,2016
SUMMARY
Variable
Types
of Qualitative Quantitative
variables or categorical measurement
Nominal Ordinal Discrete Continuous
(not ordered) (ordered) (count data) (real-valued)
e.g. ethnic e.g. response e.g. # of e.g. height
group to treatment admissions
Measurement scales
Biostatistics -Notes WA , SPH AAU ,2016
Types of variable
Continuous variable: It can have an infinite number
of possible values in any given interval.
• Unlimited number of possible values
• Infinite number of values can fall b/n any 2
observed values
• No gaps between units
Example. time taken to solve a problem
height or weight, weight/Temperature of patients
Biostatistics -Notes WA , SPH AAU ,2016
Sources of data
Routinely kept records
– Hospital medical records, accounting records
Survey
– Mode of transportation used by patients to visit the
clinic.
Experiments
– Best strategies for maximizing patient compliance.
External sources.
– An already published data
Biostatistics -Notes WA , SPH AAU ,2016
Types of data
1. Primary source data: primary data are those data which are collected by
the investigator himself (herself) for the purpose of a specific goal or study.
Example: data gathered from interview, questionnaire, or field observation of the
investigator or researcher.
2. Secondary source data: when an investigator uses data which have already
been collected by others. Secondary sources can be individuals or agencies,
which supply data originally collected for other purposes by them or others.
• They are less expensive in time and cost than Primary data.
• Usually they are published or unpublished materials, records, reports,
e t c.
Biostatistics -Notes WA , SPH AAU ,2016
Descriptive Statistics
• Techniques used to organize and summarize a
set of data in a concise way.
–Organization of data
–Summarization of data
–Presentation of data
• Numbers that have not been summarized
and organized are called raw data.
Biostatistics -Notes WA , SPH AAU ,2016
Descriptive cont...
• Statistics is used to organize and interpret
research observations and findings
• Before interpretation & communication of the
findings, the raw data must be organized and
presented in a clear and understandable way
Biostatistics -Notes WA , SPH AAU ,2016
Descriptive cont….
Ordered array: A simple arrangement of individual
observations in order of magnitude.
Frequency distribution: A table which involves a listing
of all observed values of the variable being studied and
how many times each value is observed.
a) Qualitative variable: Count the number of cases in each
category.
b) Quantitative variable: Select a set of continuous, non-
overlapping intervals such that each value in the set of
observations can be placed in one, and only one of the
intervals
Biostatistics -Notes WA , SPH AAU ,2016
Descriptive cont…
Frequency distribution:
• The actual summarization and organization of
data starts from frequency distribution.
• The distribution condenses the raw data into a
more useful form and allows for a quick visual
interpretation of the data.
Biostatistics -Notes WA , SPH AAU ,2016
Frequency distributions for
categorical variables
• Summarizing categorical variables (nominal &
ordinal) is simple
• Count the number of observations (frequency)
in each category and present as relative
frequencies (percentages)
• Often presented in the form of table, bar and
pie charts
Biostatistics -Notes WA , SPH AAU ,2016
Frequency , categorical cont...
• A relative frequency distribution: shows the
proportion of counts that fall into each class or
category
• A relative frequency value for any category is
obtained by dividing the number of
observations in that category by the total
number of observations.
• This can be reported as a percentage by
multiplying the resulting fraction by 100.
Biostatistics -Notes WA , SPH AAU ,2016
Cumulative frequency distribution
Cumulative frequencies: When frequencies of
two or more classes are added.
Cumulative relative frequency: The percentage
of the total number of observations that have a
value either in that interval or below it.
Mid-point: The value of the interval which lies
midway between the lower and the upper limits of
a class.
Biostatistics -Notes WA , SPH AAU ,2016
Cumulative frequency cont…
True limits(class boundaries): Are those limits
that make an interval of a continuous variable
continuous in both directions
Used for smoothening of the class intervals
Subtract 0.5 from the lower and add it to the
upper limit
Biostatistics -Notes WA , SPH AAU ,2016
Frequency distributions
• Data contain information and that summarization is a way
of making it easier to determine the nature of the
information.
• Relative frequency distributions: is most often used in
scientific publications to describe quantitative data sets.
They are better suited to the description of large data sets
and they permit a greater flexibility in the choice of class
widths.
-A frequency distribution is a table that organizes data
into classes.
-non overlapping classes, i.e. classes without common
items.
Biostatistics -Notes WA , SPH AAU ,2016
Guidelines for constructing tables
• Keep them simple
• Limit the number of variables to three or less
• All tables should be self-explanatory
• Include clear title telling what, when and where
• Clearly label the rows and columns
• State clearly the unit of measurement used
• Explain codes and abbreviations in the foot-note
• Show totals
• If data is not original, indicate the source in foot-note
Biostatistics -Notes WA , SPH AAU ,2016
• Example 1 The classification of students of a group by
the score on the subject “Statistical analysis” is presented
in Table 2.0a. The table of frequencies for the data set
generated by computer using the software SPSS is shown
in Figure 2.1.
•
Biostatistics -Notes WA , SPH AAU ,2016
Frequency percent Valid percent Cumulative
percent
Bad 6 13.3 13.3 13.3
Excellent 18 40.0 40.0 53.3
Good 15 33.4 33.4 88.7
Medium 6 13.3 13.3 100
Total 45 100 100
Biostatistics -Notes WA , SPH AAU ,2016
Steps to follow to construct a grouped frequency
distribution.
1. Make sure that you have a quantitative data
2. Find the range of the data
3. Determine the number of classes that you wish to have or
use sturge’s rule
4. Determine the width of the class
5. Determine the first lower class limit of the first class and
all the subsequent lower class limits
6. Write all the upper class limits of the classes
7. Finally, for each class, count the number of observation
and construct the freq. distribution, accordingly
Biostatistics -Notes WA , SPH AAU ,2016
• Example 3.6 Construct frequency table for the data set of
the above example on Age of 189 subjects.
K=1+3.322log(n) ~9 (Use 6 for the simplicity)
W=R/k ~5.788 (Use 10 for simplicity)
• where
• K = number of class intervals n = number of observations
• W = width of the class interval
R = Range where R= L-S
Where, L = the largest value and S= the smallest value in
certain observation.
Biostatistics -Notes WA , SPH AAU ,2016
Biostatistics -Notes WA , SPH AAU ,2016
Remarks:
• All classes of frequency table must be mutually exclusive.
• Classes may be open-ended when either the lower or the
upper end of a quantitative classification scheme is
limitless.
For example Class: age
– birth to 7 8 to 15 ........64 to 71 72 and older
– Classification schemes can be either discrete or continuous.
Biostatistics -Notes WA , SPH AAU ,2016
Diagrammatic Representation
It is Pictorial or graphic
presentations of numerical data
Biostatistics -Notes WA , SPH AAU ,2016
Graphical description of quantitative data:
Histogram and Polygon:
There is an old saying that “one picture is worth a
thousand words”.
Indeed, statisticians have employed graphical techniques
to describe sets of data more vividly.
Bar charts and pie charts were presented before to
describe qualitative data.
With quantitative data summarized into frequency,
relative frequency tables , however, histograms and
polygons are used to describe the data.
Biostatistics -Notes WA , SPH AAU ,2016
Importance of diagrammatic
representation
Much attractive than mere figures
Required information can be obtained in
Less time without mental strain.
Facilitates comparison
Pattern of change in data can be detected
easily
Stays in memory for more time
Used to understand patterns and trends
Biostatistics -Notes WA , SPH AAU ,2016
Limitations of diagrams
Can not be used as substitute for data
Not an alternative to tabulation
No accuracy ensured , gives only approximate
idea
When graphs are poorly designed, they not
only do not effectively convey your message,
they often mislead and confuse.
Biostatistics -Notes WA , SPH AAU ,2016
Diagrammatic……
Specific types of graphs include:
• Bar graph
Nominal, ordinal data
• Pie chart
• Histogram
• Stem-and-leaf plot
• Box plot Quantitative
• Scatter plot data
• Line graph
• Others
Biostatistics -Notes WA , SPH AAU ,2016
Graphical description of qualitative data
• Bar graphs and pie charts are two of the most widely used
graphical methods for describing qualitative data sets.
• Bar graphs give the frequency (or relative frequency) of
each category
• Example 1.3a (Bar Graph)
45
40
35
30
25
20
15
10
5
0
• Bad
Figure 1.3 Bar graph Excellent
showing the Good
number of students Medium
of each category
Biostatistics -Notes WA , SPH AAU ,2016
Two-way table (Cross tabulation):
• This table shows two characteristics and is formed when either of the two
variables (the caption or the stub) is divided into two or more parts.
• For instance , the marital status and cervical cancer status can be presented
in the following two way table.
Marital status Cervical Cancer status
Positive Negative
Single 49 47
Married 216 108
Widowed 87 86
Div/sep 15 45
Biostatistics -Notes WA , SPH AAU ,2016
Graphical (Diagrammatic) Presentation of Data.
• I. Bar Graph
• The bar graph is very commonly used and is better for representation of
qualitative data. Bars are vertical lines, where the lengths of the bars are
proportional to their corresponding numerical values and the bars should be
equally space.
• Example: if following data indicates the number clinical Nurses in given
woreda, it can be presented using different diagrams.
Degree Diploma Certefficate
Private 45 66 21
Gov't 48 46 12
NGO 12 24 4
Biostatistics -Notes WA , SPH AAU ,2016
70
60
50
40
30
20
10
0
Private Gov't NGO
degree Diploma Certefficate
Graph2.1 the bar Graph presentation for the number
clinical nurses in given woreda
Biostatistics -Notes WA , SPH AAU ,2016
Multiple bar graph
Sub-divided bar graph
Biostatistics -Notes WA , SPH AAU ,2016
III. Pie diagram (Pie chart)
• Pie chart enables us to show the partitioning of a total in to its component parts.
• The diagram is in the form of circle and component as slices of the circle.
• The size of the slice represents the proportion of the component out of the total.
• The angle of a component (x) is calculated as:
value of component X 0
Degree of X= ×360
total value of the components
Example: The following data indicates the marital status of 40 women who came for the
service of contraceptives to St. Paul HMMC. Present the data using Pie- diagram.
Marital status Married widowed separated single
Frequency 8 12 16 4
Biostatistics -Notes WA , SPH AAU ,2016
• Degree of the slice for married is calculated as:
number of married women
deg ree of Married women 3600
total women
8
deg ree of Married women = ×3600 720
40
Like with the slice degree of the pie chart of the women for widowed, separated and
single women becomes is 108, 144and 36, respectively.
Frequency
Single
10% Married
20%
Separated
40% Widowed
30%
Graph 2.3: The Pie- diagram presentation of 40 women who came for for
contraceptive service to St. Paul HMMC.
Biostatistics -Notes WA , SPH AAU ,2016
Pie charts
Divide a complete circle (a pie) into slices, each
corresponding to a category, with the central angle and
hence the area of the slice proportional to the category
relative frequency.
Example 1.4b (Pie Chart)
Figure 1.3 Pie chart showing the number of students of each
category
Biostatistics -Notes WA , SPH AAU ,2016
Graphical description of quantitative data:
• Stem and Leaf displays
• Widely used in exploratory data analysis when the data set
is small.
• In order to explain what is a stem and what is a leaf we
consider the data from the table 1.4.1 (A foundation for
analysis in the health sciences. Biostatistics, Daniel)
• Steps to follow in constructing a Stem and Leaf Display
– Divide each observation in the data set into two parts,
the Stem and the Leaf.
– List the stems in order in a column, starting with the
smallest stem and ending with the largest.
– Proceed through the data set, placing the leaf for each
observation in theBiostatistics
appropriate stem
-Notes WA , SPH AAU ,2016 row.
Example 1.5
Table 1.4.1 contains a list of the ages of subjects who
participated in the study on smoking cessation discussed
in Example 1.4.1. As can be seen, this unordered table
requires considerable searching for us to ascertain such
elementary information as the age of the youngest and
oldest subjects.
Biostatistics -Notes WA , SPH AAU ,2016
Biostatistics -Notes WA , SPH AAU ,2016
Biostatistics -Notes WA , SPH AAU ,2016
Biostatistics -Notes WA , SPH AAU ,2016
The stem and leaf display of Figure 2.3.8 partitions the data
set into 11 classes corresponding to 11 stems. Thus, here
two-lines stems are used. The number of leaves in each
class gives the class frequency.
Advantages of a stem and leaf display over a frequency
distribution (considered in the next section):
1. the original data are preserved.
2. a stem and leaf display arranges the data in an orderly
fashion and makes it easy to determine certain numerical
characteristics to be discussed in the coming topics.
3. the classes and numbers falling in them are quickly
determined once we have selected the digits that we want
to use for the stems and leaves.
Biostatistics -Notes WA , SPH AAU ,2016
Histogram
• When plotting histograms, the phenomenon of interest is
plotted along the horizontal axis, while the vertical axis
represents the number, proportion or percentage of
observations per class interval – depending on whether or
not the particular histogram is respectively, a frequency
histogram, a relative frequency histogram or a percentage
histogram.
• Histograms are essentially vertical bar charts in which the
rectangular bars are constructed at midpoints of classes.
Biostatistics -Notes WA , SPH AAU ,2016
• Example 3.7 Below we present the frequency histogram
for the data set considered above, for which the
frequency table is constructed in Table 2.3.2.
Biostatistics -Notes WA , SPH AAU ,2016
• Remark: When comparing two or more sets of data, the
various histograms can not be constructed on the same
graph because superimposing the vertical bars of one on
another would cause difficulty in interpretation.
• For such cases it is necessary to construct relative
frequency or percentage polygons.
Biostatistics -Notes WA , SPH AAU ,2016
Polygons
• As with histograms, when plotting polygons the
phenomenon of interest is plotted along the horizontal
axis while the vertical axis represents the number,
proportion or percentage of observations per class interval
• depending on whether or not the particular polygon is
respectively, a frequency polygon, a relative frequency
polygon or a percentage polygon. For example, the
frequency polygon is a line graph connecting the
midpoints of each class interval in a data set, plotted at a
height corresponding to the frequency of the class.
Biostatistics -Notes WA , SPH AAU ,2016
• Example 3.8 Figure 2.3.4 is a frequency
polygon constructed from data in Table 2.3.2.
Biostatistics -Notes WA , SPH AAU ,2016
Cumulative distributions and cumulative polygons
• Other useful methods of presentation which facilitate
data analysis and interpretation are the construction of
cumulative distribution tables and the plotting of
cumulative polygons.
• A cumulative frequency distribution enables us to see
how many observations lie above or below certain
values, rather than merely recording the number of items
within intervals.
Biostatistics -Notes WA , SPH AAU ,2016
Ogive curve
• We may, for example, be interested in knowing
the number of patients whose weight is less than
50 Kg or more than say 60 Kg.
• To get this information it is necessary to change the form
of the frequency distribution from a ‘simple’ to a
‘cumulative’ distribution.
• Ogive curve turns a cumulative frequency distribution in
to graphs.
Biostatistics -Notes WA , SPH AAU ,2016
• Example: Heart rate of patients admitted in
• hospital Y, 2013.
Biostatistics -Notes WA , SPH AAU ,2016
Biostatistics -Notes WA , SPH AAU ,2016
Box and Whisker plot
• It is another way to display information when the
objective is to illustrate certain locations in the
distribution.
• A box is drawn with the top of the box at the third
quartile and the bottom at the first quartile.
• The location of the mid‐point of the distribution is
indicated with a horizontal line in the box.
• Finally, straight lines, or whiskers, are drawn from the
centre of the top of the box to the largest observation and
from the centre of the bottom of the box to the smallest
observation.
Biostatistics -Notes WA , SPH AAU ,2016
A box and Whisker diagram
A b and Whisker diagram
Biostatistics -Notes WA , SPH AAU ,2016
Scatter plot
• Most studies in medicine involve measuring more than
one characteristic, and graphs displaying the relationship
between two characteristics are common in the literature.
• When both the variables are qualitative then we can use a
multiple bar graph.
• When one of the characteristics is qualitative and the other
is quantitative, the data can be displayed in box and
whisker plots.
• To illustrate the relationship between two characteristics
when both are quantitative variables we use bivariate
plots (also called scatter plots or scatter diagrams).
Biostatistics -Notes WA , SPH AAU ,2016
Scatter plot
Biostatistics -Notes WA , SPH AAU ,2016
Line graph
Useful for assessing the trend of particular situation overtime.
Helps for monitoring the trend of epidemics.
The time, in weeks, months or years, is marked along the
horizontal axis
Values of the quantity being studied is marked on the vertical
axis.
Values for each category are connected by continuous line.
Sometimes two or more graphs are drawn on the same graph
taking the same scale so that the plotted graphs are
comparable.
Biostatistics -Notes WA , SPH AAU ,2016
Example: Infant and under five mortality rate in Ethiopia, 1970-2005 (Tefera Darge
2011; EDHS, 2000, 2005)
1970-75 1975-80 1980-85 1985-90 1990-95 1995-2000 2000-05
IMR 239 219,4 199,5 190 165 141 123
U5MR 160 138,8 127 104,8 95 83 77
160 IMR U5MR
138.8
127
104.8
95
239
219.4 83
199.5 190 77
165
141
123
1970-75 1975-80 1980-85 1985-90 1990-95 1995-2000 2000-2005
Graph 2.4 Infant and under five mortality rate in Ethiopia, 1970-2005
(Tefera Darge 2011; EDHS, 2000, 2005)
Biostatistics -Notes WA , SPH AAU ,2016
No. of microscopically confirmed malaria cases by species and
month at Zeway malaria control unit, 2003
2100
No. of confirmed malaria cases
1800 Positive
1500 P. falciparum
P. vivax
1200
900
600
300
0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Months
Biostatistics -Notes WA , SPH AAU ,2016
Line graph cont..
Line graph can be also used to depict the
relationship between two continuous
variables like that of scatter diagram .
The following graph shows level of zidovudine
(AZT) in the blood of AIDS patients at several
times after administration of the drug, for with
normal fat absorption and with fat mal
absorption.
Biostatistics -Notes WA , SPH AAU ,2016
Line graph cont…..
Response to administration of zidovudine in two groups of AIDS
patients in hospital X, 1999
8
7
Blood zidovudine
concentration
6
5
4
3
2
1
0
10
20
70
80
100
120
170
190
250
300
360
Tim e since adm inistration (Min.)
Fat malabsorption Normal fat absorption
Biostatistics -Notes WA , SPH AAU ,2016
Descriptive summary statistics
• Introduction
• In the previous section data were collected and
appropriately summarized into tables and charts.
• Next a variety of descriptive summary measures will be
developed.
• These descriptive measures are useful for analyzing and
interpreting quantitative data, whether collected in raw
form (ungrouped data) or summarized into frequency
distributions (grouped data)
Biostatistics -Notes WA , SPH AAU ,2016
Types of numerical descriptive measures
Four types of characteristics which describe a
data set pertaining to some numerical variable or
phenomenon of interest are:
1. Location
2. Dispersion
3. Relative standing
4. Shape
Biostatistics -Notes WA , SPH AAU ,2016
numerical descriptive measures
• In any analysis and/or interpretation of numerical data, a
variety of descriptive measures representing the properties
of location, variation, relative standing and shape may be
used to extract and summarize the salient features of the
data set.
• If these descriptive measures are computed from a sample
of data they are called statistics . In contrast, if these
descriptive measures are computed from an entire
population of data, they are called parameters.
• Since statisticians usually take samples rather than use
entire populations, our primary emphasis deals with
statistics rather than parameters.
Biostatistics -Notes WA , SPH AAU ,2016
Measures of central tendency (MCT)
• On the scale of values of a variable there is a
certain stage at which the largest number of items
tend to cluster.
• Since this stage is usually in the centre of
distribution, the , tendency of the statistical data to
get concentrated at certain values is called “central
tendency”
• The various methods of determining the actual
value at which the data tends to concentrate are
called measures of central tendency.
Biostatistics -Notes WA , SPH AAU ,2016
Measures of central tendency (MCT)
• The most important objective of calculating measure of
central tendency is to determine a single figure which may
be used to represent a whole series involving magnitude of
the same variable.
• In that sense it is an even more compact description of the
statistical data than the frequency distribution.
• Since a measure of central tendency represents the entire
data, it facilitates comparison with in one group or between
groups of data.
Biostatistics -Notes WA , SPH AAU ,2016
Position
Biostatistics -Notes WA , SPH AAU ,2016
Characteristics of a good measure of central tendency
A measure of central tendency is good or satisfactory if it
possesses the following characteristics.
1. It should be based on all the observations
2. It should not be affected by the extreme values
3. It should be as close to the maximum number of values
as possible
4. It should have a definite value
5. It should not be subjected to complicated and boring
calculations
6. It should be capable of further algebraic treatment
7. It should be stable with regard to sampling
Biostatistics -Notes WA , SPH AAU ,2016
Measures of location ( or central tendency)
A. Arithmetic Mean
A) Ungrouped data
Sample mean n
x x2 ... xn x i
x 1 i 1
• n n
Population mean
N
x Sum of the values of all observations in population
i
i 1
N Total number of observations in population
Biostatistics -Notes WA , SPH AAU ,2016
Arithmetic Mean
b) Grouped data
• In calculating the mean from grouped data, we assume
that all values falling into a particular class interval are
located at the mid-point of the interval. It is calculated as
follow:
where,
k = the number of class intervals
mi = the mid-point of the ith class interval
fi = the frequency of the ith class interval
Biostatistics -Notes WA , SPH AAU ,2016
Example 1: if the mark of five medical students is: 80, 75, 60, 50, 90 the mean
mark of the students is calculated as:
n 5
X i XX1 X 2 X 3 . ... X 5
i
X i 1
i 1
n 5 5
80 75 60 50 90
X 71.
5
• Therefore, the mean mark of the students was 71.
Exercise: You measure the body lengths (in inches) of 10 full-term infants at
birth and record the following: 17.5, 19.5, 17.5, 19, 20, 21, 18,
19.5, 18, 10.75. Compute the mean length of the infants for these
data.
Biostatistics -Notes WA , SPH AAU ,2016
Example 2: If the age patients diagnosed in a given day is given below.
Compute the mean age of the patients diagnosed per day.
Age of patients (year) 54 64 74 84 94 104
Number of patients 2 3 5 10 3 2
The mean agenof the patients can be calculated as:
X f i i
x1f1 x 2 f 2 ... x n f n
X i 1
n
f1 f 2 ... f n
f
i 1
i
54 2 64 3 74 5 84 10 94 3 104 2
X 68.72
2 3 5 10 3 2
Hence, the mean age of the patients is that were diagnosed
in that day was 68.72 year.
Biostatistics -Notes WA , SPH AAU ,2016
The arithmetic mean properties.
• For given set of data there is one and only one arithmetic
mean.
• The arithmetic mean is easily understood and easy to
compute.
• Algebraic sum of the deviations of the given values from
their arithmetic mean is always zero.
• The arithmetic mean possesses all the characteristics of a
central value, except No.2, which is greatly affected by
the extreme values.
• In case of grouped data if any class interval is open,
arithmetic mean can not be calculated
Biostatistics -Notes WA , SPH AAU ,2016
Mean Sensitive to Outliers
•
6
5
4
3 Mean = 12.0
2
1
0
0 5 10 15 20 25 30 35 40 45 50
Nights of stay
Mean = 15.3
Biostatistics -Notes WA , SPH AAU ,2016
Advantage and disadvantage of Mean
Advantage Disadvantage
Mathematical center of a It is affected by extreme
distribution. values and skewed
Just as far from scores above it distributions that are not
as it is from scores below it. representative of the rest of the
data.
Good for interval and ratio
data. May not exist in the data.
includes all the values of the
data set and unique .
Inferential statistics is based on
mathematical properties of the
mean Biostatistics -Notes WA , SPH AAU ,2016
Measures of location ( or central tendency)
• Example Consider 189 subjects: 48,35,…66.
By definition , the mean is calculated as:
= (48+35+…+66)/189 = 55.032
Biostatistics -Notes WA , SPH AAU ,2016
Median
Definition 3.2
• The median m of a sample of n observations arranged in
ascending or descending order is the middle number that
divides the data set into two equal half: one half of the
items lie above this point, and the other half lie below it.
a) Median (~x ) a), Ungrouped data
xk if n 2k 1 ( n is odd)
~
X Median 1
xk xk 1 if n 2k ( n is even)
2
n 1 th
largest value, when n (size of the data) is odd
2
median(X)
1 n
th
n 2
th
2 2 2 value, when n is even
Biostatistics -Notes WA , SPH AAU ,2016
Example : to find the median of: 6,2,7,13,4,9,15,1,12.
Arrange the data in increasing order: 1, 2, 4, 6, 7, 9, 12, 13, 15.
n 1
th
The sample size, n=9 (odd). So the median is the value, . value
largest
2
9 1
th
The median of the data becomes larg est value the 5 value ;
th
which is 7. 2
Exercise: Compute the sample median for the birth weight data Solution:
3265, 3314, 2581, 2759, 2834, 2838, 2841, 3031, 3200, 3245, 3260, 3323,
4146, 3609, 3484,, 3101, 3248, 2069, 3649, 3541.
Biostatistics -Notes WA , SPH AAU ,2016
Median
• In calculating the median from grouped data, we assume
that the values within a class‐interval are evenly
distributed through the interval.
• The first step is to locate the class interval in which it is
located. We use the following procedure.
• Find n/2 and see a class interval with a minimum
cumulative frequency which contains n/2.
• To find a unique median value, use the following
interpolation formal.
Biostatistics -Notes WA , SPH AAU ,2016
Median
• where,
Lm = lower true class boundary of the interval containing the
median
Fc = cumulative frequency of the interval just above the
median class interval
fm = frequency of the interval containing the median
W= class interval width
n = total number of observations
Biostatistics -Notes WA , SPH AAU ,2016
Properties of the median
• There is only one median for a given set of data
• The median is easy to calculate
• Extreme values in data set do not affect the median as
strongly as they do the mean.
• Median can be calculated even in the case of open end
intervals
• It is not a good representative of data if the number of
items is small
Biostatistics -Notes WA , SPH AAU ,2016
Advantage and Disadvantage of Median
Advantage Disadvantage
Not influenced by extreme scores May not exist in the data.
or skewed distributions. Doesn’t take actual value into
Good for ordinary data account
Easier to compute than the mean.
Biostatistics -Notes WA , SPH AAU ,2016
• Example 1 Find the median of the data set consisting of the observations
7, 4, 3, 5, 6, 8, 10.
Solution First, we arrange the data set in ascending order
3 4 5 6 7 8 10.
Since the number of observations is odd, n = 2 x 4 - 1, then median m = x4
= 6. We see that a half of the observations, namely, 3, 4, 5 lie below the
value 6 and an another half of the observations, namely, 7, 8 and 10 lie
above the value 6.
Example 2 Suppose we have an even number of the observations 7, 4, 3,
5, 6, 8, 10, 1. Find the median of this data set.
Solution First, we arrange the data set in ascending order
1 3 4 5 6 7 8 10.
Since the number of the observations n = 2 x 4, then by Definition
Median = (x4+x5)/2 = (5+6)/2 = 5.5
Biostatistics -Notes WA , SPH AAU ,2016
Mode
a) Ungrouped data
• The mode of a data set is the value of that occurs with
the greatest frequency, i.e., is repeated most often in the
data set.
• If all the values are different there is no mode, on the
other hand, a set of values may have more than one
mode.
b) Grouped data
• In designating the mode of grouped data, we usually refer
to the modal class, where the modal class is the class
interval with the highest frequency.
• Mode= lm+( A /A1+A2)W, where A frequency of mode class A1 difference of
frequency immediately above modal class, A2 d/f b/n model class frequency and the frequency
below the model class , W widthBiostatistics
of the class-Notes WA , SPH AAU ,2016
interval
Properties of mode
• The mode can be used as a summary measure for
nominal, ordinal, discrete and continuous data, in
general however, it is more appropriate for nominal
and ordinal data.
• It is not affected by extreme values
• It can be calculated for distributions with open end
classes
• Often its value is not unique
• The main drawback of mode is that often it does not
exist
Biostatistics -Notes WA , SPH AAU ,2016
Advantage and Disadvantage of Mod
Advantage Disadvantage
The mode is not used as often
Good for nominal data. to measure central tendency
Like the median, the mode as are the mean and the
is not unduly affected by median.
extreme values. Too often, there is no modal
We can use the mode no value because the data set
matter how large, how contains no values that occur
small, or how spread out the more than once.
values in the data set happen Ignores most of the
to be. information in a distribution.
Easiest to compute and When data sets contain two,
understand three, or many modes, they
are difficult to interpret and
compare.
Biostatistics -Notes WA , SPH AAU ,2016
Example Find the mode of the data set in given below.
Biostatistics -Notes WA , SPH AAU ,2016
Geometric mean (GM)
• If x, xi ... xn, x are n positive observed values, then
and
• The geometric mean is generally used with data
measured on a logarithmic scale.
Biostatistics -Notes WA , SPH AAU ,2016
Harmonic mean (HM)
• Just as the geometric mean is based on an arithmetic
mean of logarithms, so is the harmonic mean based on
arithmetic mean of the reciprocals.
• We define it as the reciprocal of the arithmetic mean of
the reciprocal of the given numbers.
• If the given numbers are X1 X2... xn, , then
Biostatistics -Notes WA , SPH AAU ,2016
Weighted mean (WM)
• If the given numbers are X1 X2... xk, and have
known weights w1 w2 ... wk,
Biostatistics -Notes WA , SPH AAU ,2016
Comparing the Mean, Median and Mode
• In general, for a data set 3 measures of central tendency:
the mean , the median and the mode are different. For
example, for the data set on Age of 189 subjects, mean
=55.032, median = 54 and mode = 53.
• If all observations in a data set are arranged symmetrically
about a observation then this observation is the mean, the
median and the mode.
• Which of these three measures of central tendency is
better? The best measure of central tendency for a data set
depends on the type of descriptive information you want.
Biostatistics -Notes WA , SPH AAU ,2016
Percentiles and Quartiles
• The quartiles are sets of values which divide the distribution
into four parts such that there are an equal number of
observations in each part.
– Q1 = [(n+1)/4]th
– Q2 = [2(n+1)/4]th
– Q3 = [3(n+1)/4]th
Biostatistics -Notes WA , SPH AAU ,2016
Percentiles and Quartiles
• Percentiles divide the data into 100 parts of observations in
each part.
• It follows that the 25th percentile is the first quartile, the 50th
percentile is the median and the 75th percentile is the third
quartile.
Percentile = p(n+1), p=the required percentile
Biostatistics -Notes WA , SPH AAU ,2016
Percentile Cont....
The pth percentile is a value that is p% of the
observations and the remaining (1-p)%.
The pth percentile is:
– The observation corresponding to p(n+1)th if
p(n+1) is an integer
– The average of (k)th and (k+1)th observations if
p(n+1) is not an integer, where k is the largest
integer less than p(n+1).
• If p(n+1) = 3.6, the average of 3th and 4th observations.
• P50 =50 th percentile=Q2 , P25= 25 th percentile=Q1
Biostatistics -Notes WA , SPH AAU ,2016
Example
Given a sample of size n = 60, find the 10th
percentile of the data set.
p(n+1) = 0.10(60+1) = 6.1
= Average of 6th and 7th
10% of the observations are less than or equal to this
value and 90% of them are greater than or equal to
the value
Biostatistics -Notes WA , SPH AAU ,2016
Exercise; Birth weight (gm) data for 20 infants
2069, 2581, 2759, 2834, 2838, 2841, 3031, 3101,
3200, 3245, 3248, 3260, 3265, 3314, 3323, 3484,
3541, 3609, 3649, 4146
Question
1. Compute the Q3, 10th and 90th percentiles
15.75 =15 th +0.75(16-15)
3323+ 0.75(3484-3323)=
2 +0.1(3 -2 )
2581+ 0.1(2759-2581)
Answer 117117
10th percentile = 0.1(20+1) = 2.1 = Average of 2nd and 3rd value =
(2581+2759)/2 = 2670gm
90th percentile = 0.9(20+1) = 18.9 = Average of 18th and 19th value
= (3609+3649)/2 = 3629gm
We estimate that 80% fall between 2670-3629gm
Biostatistics -Notes WA , SPH AAU ,2016
Percentiles
• Simply divide the data into 100 pieces.
• Percentiles are not dependent on the distribution of the
data.
Biostatistics -Notes WA , SPH AAU ,2016
Using measures of central tendency
• Given a set of observations, an investigator may naturally
ask which measure of central tendency is best to use with
the data.
• Two factors are important in making this decisions:
1. The scale of measurement
2. The shape of the distribution of observations
Biostatistics -Notes WA , SPH AAU ,2016
1. The arithmetic mean is used for interval and ratio data
and for symmetric distribution.
2. The median and quartiles are used for ordinal, interval
and ratio data whose distribution is skewed.
3. For nominal data mode is the appropriate MCT.
4. The geometric mean is used primarily for observations
measured on a logarithmic scale.
5. Harmonic mean is a suitable MCT when the data
pertains to rates and time.
6. Weighted mean is commonly used in the construction of
index number.
Biostatistics -Notes WA , SPH AAU ,2016
Measures of variability
• The measure of central tendency alone is not
enough to have a clear idea about the distribution
of the data.
• Moreover, two or more sets may have the same
mean and/or median but they may be quite
different.
• Thus to have a clear picture of data, one needs to
have a measure of dispersion or variability
(scatterdness) amongst observations in the set.
Biostatistics -Notes WA , SPH AAU ,2016
Measures of variability
• Reporting only an average without accompanying measure
of variability may misrepresent a set of data.
• – Two datasets can have the same average but very
different variability.
Biostatistics -Notes WA , SPH AAU ,2016
Variation is important: Non statistician drowning in a river of average depth 0.3 meter.
Biostatistics -Notes WA , SPH AAU ,2016
Objectives of Measuring Variation
1. To judge the reliability of a measure of central tendency
2. To compare two or more sets of data with regard to their
variability
3. To control variability itself like in quality control, body
temperature, etc
4. To make further statistical analysis or to facilitate the use of
other statistical measures.
>
Biostatistics -Notes WA , SPH AAU ,2016
Range (R)
• R = xmax – xmin, where
XL is the largest value and XS is the smallest value.
Example: for the given data set: 100, 95, 125, 45, 70, the range is calculated
as:
R= xmax – xmin
R= 125 – 45
Range = 80.
Properties of Range
• Range and relative range are easy to calculate and simple to understand.
• Both cannot be computed for grouped data with open ended classes.
• They do not tell us anything about the distribution of values in the
series.
Exercise1: Find the range for the monthly salary of ten workers in a certain
health center given below. 462, 480, 534, 624, 498, 552,606, 588, 516,
570.
Biostatistics -Notes WA , SPH AAU ,2016
Interquartile range (IQR)
• IQR = Q3 ‐ Q1, where
Q3 is the third quartile and Q1 is the first quartile.
Example: Suppose the first and third quartile for weights of
girls 12 months of age are 8.8 Kg and 10.2 Kg respectively.
The interruptible range is therefore,
IQR = 10.2 Kg – 8.8 Kg
i.e., 50% of infant girls at 12 months weigh between
8.8 and 10.2 Kg.
Biostatistics -Notes WA , SPH AAU ,2016
Properties
• It is a simple and versatile measure
• It encloses the central 50% of the observations
• It is not based on all observations but only on two specific
values
• It is important in selecting cut‐off points in the formulation
of clinical standards
• Since it excludes the lowest and highest 25% values, it is
not affected by extreme values
• It is not capable of further algebraic treatment
Biostatistics -Notes WA , SPH AAU ,2016
Quartile deviation (QD)
QD = (Q3- Q1)/2
Coefficient of quartile deviation (CQD)
CQD=(Q3- Q1)/(Q3+Q1)
CQD is an absolute quantity (unitless) and is useful to
compare the variability among the middle 50%
observations
Biostatistics -Notes WA , SPH AAU ,2016
Variance and standard deviation
• A measure of dispersion relative to the scatter of the values
about their mean.
The population variance of the population of the
observations x is defined by the formula
• The variance is the average of the squares of the deviations
taken from the mean.
• The sum of squared deviations divided by the number of
deviations from the mean gives us the average sums of
squared deviations known as the variance
Biostatistics -Notes WA , SPH AAU ,2016
Sample Variance
• The sum of squared deviations divided by the number of
deviations from the mean gives us the variance
Why divide by n‐1
• Samples give us estimates of population parameters
(population mean and variance)
• Dividing by n underestimates the population variance and
this is easily demonstrated
Biostatistics -Notes WA , SPH AAU ,2016
Another feature about n‐1
• In many statistical tests we sum variances from groups and
we lose a data point or what is sometimes referred to as
degrees of freedom.
• As noted already in order to make estimates from samples
to a population certain conditions have to be met.
• An additional one being that the sum of the deviation
scores around the mean must add up to zero.
• For each sample estimate we therefore lose a degree of
freedom – all numbers on which the estimate is based are
free to vary except one.
Biostatistics -Notes WA , SPH AAU ,2016
Variance and standard deviation
(A), Ungrouped data
• • Let X1, X2, ..., XN be the measurement on N
• population units, then σ2 =
• The sample variance of the set x1, x2, ..., xn of n
observations is
(B), grouped data
Biostatistics -Notes WA , SPH AAU ,2016
Group data
• Where
mi = the mid‐point of the ith class interval
fi = the frequency of the ith class interval
= the sample mean
k = the number of class intervals
Biostatistics -Notes WA , SPH AAU ,2016
Example: If the blood sugar level of small population is: 80, 70, 95, 100, 125.
Calculate the variance and standard deviation of the data.
Solution: As the data is collected from the population, the variance is calculated
using:
X
2
2
i
N
But first theN mean is calculated as:
i 1
Xi
80 70 95 100 125
N 5
470
94
5
To calculate the variance:
N
X
2
80 94 70 94 95 94 100 94 125 94
i 2 2 2 2 2
2 i 1
N 5
1770
= 354
5
The standard deviation will be:
S.D var iance 354 18.8
Biostatistics -Notes WA , SPH AAU ,2016
For grouped data with frequency, the population variance is
calculated as:
f X
2
2
i i
N
The standard deviation is the square root of the variance.
i.e. S.D Variance
• Example: In the study, the weight of six new born babies was recorded
below. Find the variance and S.D
Weight (K.G) 1.5 2.5 3
Frequency 2 3 1
Biostatistics -Notes WA , SPH AAU ,2016
Solution: Before calculating the variance the mean weight of the
babies will be:
• N
Xi fi
1.5 2 2.5 3 3 1
i 1
2.25
N
2 3 1
• f
i 1
i
fi X i
2
2(1.5 2.25)2 3(2.5 2.25)2 1(3 2.25)2
2
N 6
1.5 0.75 0.75
= 0.5
6
• Hence, the weight variability of the new born babies is 0.5
• And the standard deviation will be:
S.D var iance 0.5 0.707
Biostatistics -Notes WA , SPH AAU ,2016
Example: If the blood sugar level of small population is: 80, 70, 95, 100, 125.
Calculate the variance and standard deviation of the data.
Solution: As the data is collected from the population, the variance is calculated
using:
X
2
2
i
N
But first theN mean is calculated as:
i 1
Xi
80 70 95 100 125
N 5
470
94
5
To calculate the variance:
N
X
2
80 94 70 94 95 94 100 94 125 94
i 2 2 2 2 2
2 i 1
N 5
1770
= 354
5
The standard deviation will be:
S.D var iance 354 18.8
Biostatistics -Notes WA , SPH AAU ,2016
For grouped data with frequency, the population variance is
calculated as:
i i
2
f X
2
N
The standard deviation is the square root of the variance.
i.e. S.D Variance
• Example: In the study, the weight of six new born babies was recorded
below. Find the variance and S.D
Weight (K.G) 1.5 2.5 3
Frequency 2 3 1
Biostatistics -Notes WA , SPH AAU ,2016
Solution: Before calculating the variance the mean weight of the
babies will be:
• N
Xi fi
1.5 2 2.5 3 3 1
i 1
2.25
N
2 3 1
• f
i 1
i
fi X i
2
2(1.5 2.25)2 3(2.5 2.25)2 1(3 2.25)2
2
N 6
1.5 0.75 0.75
= 0.5
6
• Hence, the weight variability of the new born babies is 0.5
• And the standard deviation will be:
S.D var iance 0.5 0.707
Biostatistics -Notes WA , SPH AAU ,2016
Sample Variance ( S2)
For ungrouped data , sample variance is calculated using:
n
(X i X) 2
S2 i 1
n 1 Where X is the sample mean and n is the total
number of observations in the sample.
• Note: - for the sample data we divide by (n-1) instead of n as in the case of
population variance, as it gives better and unbiased estimator of the
population variance.
• Sample Standard Deviation ( S)
S.D var iance
For grouped data the sample variance
n
is calculated as:
f i (X i X) 2
S2 i 1
n
f
i 1
i -1
Biostatistics -Notes WA , SPH AAU ,2016
Example: If samples of 6 children were taken from the population with age
of: 17, 18, 19, 20, 22, 24. Calculate;
A) the variance B) the standard deviation
First the sample mean is calculated as:
n
X i
17 18 19 20 22 24 120
X 11
20
n 6 6
As the sample is considered, the variance can be formulated
as:
n
( X i X )2
(17 20)2 (18 20) 2 (19 20) 2 (20 20) 2 (22 20) 2 (24 20) 2
2
S
i 1
n 1 6 1
9 4 1 0 4 16 34
= 6.8
5 5
The S.D can be calculated as
S.D var iance 6.8 2.61
Biostatistics -Notes WA , SPH AAU ,2016
Exercise: calculate the variance and standard deviation for the following data.
1) 19, 20, 24, 12, 17, 22, 18, 20, 23, 17.
Age Frequency
2) 22 3
23 2
24 4
26 1
Q1 Q2
Mean 19.2 23.4
SD 3.489667 1.264911
Variance 12.17778 1.6
Biostatistics -Notes WA , SPH AAU ,2016
Properties
• The main demerit of variance is, that its unit is the
square of the unite of measurement of variate values
• The variance gives more weightage to the extreme
values as compared to those which are near to mean
value, because the difference is squared in variance.
• The drawbacks of variance are overcome by the standard
deviation.
Biostatistics -Notes WA , SPH AAU ,2016
Standard deviation (σ, S)
• It is the positive square root of the variance.
Biostatistics -Notes WA , SPH AAU ,2016
Properties
• Standard deviation is considered to be the best measure
of dispersion and is used widely because of the
properties of the theoretical normal curve.
• There is however one difficulty with it. If the units of
measurements of variables of two series is not the
same, then there variability can not be compared by
comparing the values of standard deviation.
Biostatistics -Notes WA , SPH AAU ,2016
Coefficient of variation (CV)
• In situations where either two series have different units of
measurements, or their means differ sufficiently in size, the
coefficient of variation should be used as a measure of
dispersion.
• It is the best measure to compare the variability of two
series of sets of observations.
• A series with less coefficient of variation is considered
more consistent.
• Coefficient of variation of a series of variate values is the
ratio of the standard deviation to the mean multiplied by
100.
Biostatistics -Notes WA , SPH AAU ,2016
• Example 3.6 Suppose that each day laboratory technician
A completes 40 analyses with a standard deviation of 5.
Technician B completes 160 analyses per day with a
standard deviation of 15. Which employee shows less
variability?
• At first glance, it appears that technician B has three times
more variation in the output rate than technician A. But B
completes analyses at a rate 4 times faster than A. Taking all
this information into account, we compute the coefficient of
variation for both technicians:
Biostatistics -Notes WA , SPH AAU ,2016
Example: In count of red blood cell (RBC) per ml of plasma concentration,
Abebe and Alemu get the following result. Which of the two lab technician
perform a reliable (consistent) measurement?
Laboratory technician Abebe Alemu
Mean count 79 64
Standard deviation 23 11
Solution: Alemu Abebe
S
S
CV 100 CV 100
x x
23
11 100 29.11%
100 17.19% 79
64
• Interpretation: the measurement of Abebe has more variability (less
consistency) than Alemu’s measurment.
Biostatistics -Notes WA , SPH AAU ,2016
Characteristics of a distribution
• The measure of central tendency and variation discussed before do not
reveal the entire story about frequency distributions.
• Two distributions may have the same mean and standard deviation but they
may differ in their shape of the distribution.
• Further description of their characteristics is necessary that is provided by
Skewness.
• In a symmetrical distribution the values of mean, median and mode are
alike. The term ‘Skewness’ refers to lack of symmetry or departure from
the symmetry.
• If extremely low or extremely high observations are present in a
distribution, then the mean tends to shift towards those scores.
Biostatistics -Notes WA , SPH AAU ,2016
Skewness
• The skewness of a distribution is measured by comparing the
relative positions of the mean, median and mode.
Distribution is symmetrical
Mean = Median = Mode
Distribution skewed right
Median lies between mode and mean, and mode is less than mean
Distribution skewed left
Median lies between mode and mean, and mode is greater than
mean
Biostatistics -Notes WA , SPH AAU ,2016
• Based on the type of skewness, distributions can be:
• a) Negatively skewed distribution: occurs when majority of scores are at
the right end of the curve and a few small scores are scattered at the left
end.
• b) Positively skewed distribution: Occurs when the majority of scores are
at the left end of the curve and a few extreme large scores are scattered at
the right end.
• c) Symmetrical distribution: It is neither positively nor negatively
skewed. A curve is symmetrical if one half of the curve is the mirror image
of the other half.
Biostatistics -Notes WA , SPH AAU ,2016
Introduction to Probability
Biostatistics -Notes WA , SPH AAU ,2016
Objective
• To provide understanding of probability and
their applications
• Calculation of probabilities using frequency
distribution
• Explain probability distribution and set the
ground for development of statistical inference
Biostatistics -Notes WA , SPH AAU ,2016
Introduction to sets
• A set is a collection of objects, sets are usually designated
by capital letters A, B,. . . etc
Example A= {a, b, c d} in the set “a” is a member of set
“A” and is denoted as a A.
• Universal set (U); is a set of all objects under consideration (U),
• Empty/null set (); is a set that contains no members.
• Given two sets A and B; If being a member of A implies being a
member B, then A is a subset of B, denoted as A B.
Biostatistics -Notes WA , SPH AAU ,2016
Introduction to sets
• Two sets A and B are equal: if A & B have the same members.
• If A B= C set C is A union B and contains elements
which are in A or in B or in both.
• If D = A B set D is A intersection B and consists of
elements which are in A and in B.
• Example A = {1, 2, 3, 4, 5} B= {a, b, 1, 2, 5, c, 6}
• A B = {1, 2, 3, 4, 5, 6, a, b, c}
• A B= {1, 2, 5}
Biostatistics -Notes WA , SPH AAU ,2016
Basic characteristics of Set
1. A = A, A = A, AU = U, AU= A
2. AA = A , A A = A;
3. AB = BA; A B=BA
4. (AB)C = A(BC); (AB) C=A(BC),
5. A(BC)=(AB)(AC); A
(BC)=(AB)U(AC)
6. (Ac)c = A
7. (AB) c = AcBc; (AB) c = AcBc
Biostatistics -Notes WA , SPH AAU ,2016
Probability
• Probability is the language of chance. The deliberate use of
chance is the central idea of statistical designs for producing data.
• Probability provide necessary tools to capture the
uncertain state of our knowledge.
• Probabilistic experiment to be any process that produces
outcomes which are not predictable in advance.
Biostatistics -Notes WA , SPH AAU ,2016
Probability
• Probabilities are used in everyday communication.
– A patient has a 50 – 50 chance of surviving a certain
operation
– The chance of a 30 year old woman to celebrate her 70th
birthday is 30%
• Because medicine is an inexact science, physicians seldom can
predict an outcome with absolute certainty.
• Example1
• To formulate a diagnosis, a physician must rely on available
diagnostic information about a patient;
– History and physical examination
– Laboratory studies,Biostatistics
X‐ray-Notes findings, ECG, etc
WA , SPH AAU ,2016
Probability
• Because no test result is absolutely accurate, it does affect
the probability of the presence (or absence) of a disease.
Example2
– We may hear a physician say that a patient has a 50—50 chance
of surviving a certain operation .
– Another physician may say that she is 95 percent certain that a
patient has a particular disease.
Biostatistics -Notes WA , SPH AAU ,2016
Probability
• understanding of probability is fundamental for
quantifying the uncertainty that is inherent in the decision
making process.
• Probability theory also allows us to draw conclusions
about a population of patients based on known information
about a sample of patients drawn from that population.
Biostatistics -Notes WA , SPH AAU ,2016
Basic terms
• A random experiment is an experiment for which the
outcome cannot be predicted with certainty, but all
possible outcomes can be identified prior to its
performance, and it may be repeated under the same
conditions.
• We call a phenomenon random if:-
– The exact outcome is not predictable in advance.
– however, there is a predictable long term pattern that can be
described by the distribution of outcomes of very many trials
Biostatistics -Notes WA , SPH AAU ,2016
Basic terms
• Sample space is the set of all possible outcomes of a
random experiment. It is denoted by S P(S) = 1
• In tossing a single six-sided die once the sample space is
S = {1, 2, 3, 4, 5, 6} .
• Equally likely: A set of events is equally likely if one of
them cannot be expected to happen in preference to
another.
– E.g. If A coin toss the outcome will be either heads
or tails.
Biostatistics -Notes WA , SPH AAU ,2016
Basic terms
• Mutually exclusive events: if the occurrence of one of
them preclude the occurrence of all others.
Two events A and B are mutually exclusive if they cannot
occur at the same time
P (A ∩ B) = 0
Example:
o A coin toss cannot produce heads and tails
simultaneously.
o Weight of an individual can’t be classified
simultaneously as “underweight”, “normal”,
“overweight”
Biostatistics -Notes WA , SPH AAU ,2016
Basic terms
• Independent Events: Two events A and B are
independent
if the probability of the first one happening is the same no
matter how the second one turns out.
The outcome of one event has no effect on the occurrence
or non-occurrence of the other.
Example:
The outcomes on the first and second coin tosses are
independent
Biostatistics -Notes WA , SPH AAU ,2016
Basic terms
• Experiment = any process with an uncertain outcome
– When an experiment is performed, one and only one
outcome is obtained.
• Event = something that may happen or not when the
experiment is performed
– An event either occurs or it does not occur.
– Events are represented by uppercase letters such as A, B, & C
Biostatistics -Notes WA , SPH AAU ,2016
Examples
1. Experiment is blood test to determine HIV status. Possible
outcomes are {HIV +} and {HIV -}.
– A1 could be the event that a test comes out positive.
– A2 could be the event that a test comes out negative.
2. Experiment is blood test and further screening to determine
HIV status (HIV+ or HIV-) and AIDS status (D+ or D-).
Events are:
– {(HIV +;D+)}; {(HIV +;D-)}; {(HIV -;D+)}; {(HIV -;D-)}
Biostatistics -Notes WA , SPH AAU ,2016
3. Experiment is to record the number of people that get tested for
HIV in one week at a given clinic. Suppose 500 is the maximum
possible number of tests given in a week. Then any non-negative
integer less than or equal to 500 is a conceivable outcome.
Events are:{0}; {1}; {2}; … ; {500}
• Note that unions and intersections of events are events.
A1 is the event that greater than 100 people get tested.
A2 is the event that fewer than 220 people get tested.
A3 is the event that greater than 100 people but fewer than 220
get tested.
• The probability of an event A, denoted by P(A), in general, is
the chance A will happen. But how to measure the chance of
occurrence , i.e., how determine the probability an event?
Biostatistics -Notes WA , SPH AAU ,2016
4. Let a box containing 100 marbles, 90 of them red and
the other 10 blue.
If the question is: ‘‘Are there red marbles in the box?’’,
someone who saw the box’s contents would answer
‘‘90%.’’
But if the question is: ‘‘If I take one marble at random,
do you think I would have a red one?’’, the answer
would be ‘‘90% chance.’’
The first 90% represents a proportion; the second 90%
indicates the probability.
Biostatistics -Notes WA , SPH AAU ,2016
Approaches to probability
1. Subjective Probability: Definitions of probability as a
quantitative measure of the “degree of certainty” of the
observer of experiment.
2. Classical definition: Definitions that reduce the concept
of probability to the more primitive notion of “equal
likelihood”
3. Statistical definition: Definitions that take as their point
of departure the “relative frequency” of occurrence of the
event in a large number of trials.
Biostatistics -Notes WA , SPH AAU ,2016
Approaches to probability
1. Subjective probability: measures the confidence or a wish
that a particular individual has in the truth of a particular
proposition.
– E.g. If some one says that he is 95% certain that a cure for
AIDS will be discovered within 5 years, then he means that
Pr(discovery of cure of AIDS within 5 years) = 95%.
• Although the subjective view of probability has enjoyed
increased attention over the years, it has not been fully
accepted by scientists.
Biostatistics -Notes WA , SPH AAU ,2016
Approaches to probability
2. The classical definition of probability:
The probability P(A) of an event A is equal to the number
of possible simple events (outcomes) favorable to A
divided by the total number of possible simple events of
the experiment, i.e., where m= number of the simple
events into which the event A can be decomposed.
The probability of an event A can be: P(A) m
N
Example 1. Consider the experiment of tossing a
balanced coin. P(H)=P(T)=1/2.
Biostatistics -Notes WA , SPH AAU ,2016
Example 2. Consider the experiment of tossing a
balanced . k=1, 2, 3, 4, 5, 6) are observed on the upper
face of the die. Therefore, P(Dk) =1/6 (k=1, 2, 3, 4, 5, 6).
Let Dodd is the event that an odd number of dots are
observed,
Deven an even number of dots are observed,
– we have P(Dodd)=3/6=1/2, P(Deven) = 3/6 = ½.
– Let A the event that a number less than 6 of dots is
observed then P(A) = 5/6
Biostatistics -Notes WA , SPH AAU ,2016
Approaches to probability
3. The statistical/Relative frequency probability:
The absolute frequency (A) of an event A in n trails is the
number of times A occurs, and the relative frequency of A in
these trials is: f ( A)
P(A)
n
Example 1. Suppose that of 158 people who attended a
dinner party, 99 were ill due to food poisoning. The
probability of illness for a person selected at random is
Pr (illness) = 99/158 = 0.63 or 63%
Biostatistics -Notes WA , SPH AAU ,2016
Example 2. The record of a certain health center showed
that out of 10000 smokers, 2940 developed lung cancer.
If one smoker is randomly selected from these group,
what is the probability that he will develop lung cancer.
Let L:=the smoker develops lung cancer
P(L)=2940/10000=0.294
Note : We will adopt the relative frequency interpretation
of probability, which says that the probability that an
event A occurs is equal to the proportion of the time that
A occurs if we repeat the random experiment again and
again to infinity:
Biostatistics -Notes WA , SPH AAU ,2016
Properties of probability
• The mathematical development of probability starts with
three basic rules or axioms:
1. The numerical value of a probability always lies between
0 and 1, inclusive. 0 P(E) 1
– A value 0 means the event can not occur
– A value 1 means the event definitely will occur
– A value of 0.5 means that the probability that the event
will occur is the same as the probability that it will not
occur.
Biostatistics -Notes WA , SPH AAU ,2016
Properties of probability
2. The sum of the probabilities of all mutually
exclusive outcomes is equal to 1.
– P(E1) + P(E2 ) + .... + P(En ) = 1.
3. For any two events A and B P(A or B) is:
– P(A or B) = P(A) + P(B) - P(A and B) (Addition rule)
– For two mutually exclusive events A and B,
P(A or B ) = P(A) + P(B).
Biostatistics -Notes WA , SPH AAU ,2016
Properties of probability
4. For any two independent events A and B:
P(A and B) =P(A) P(B) (Multiplication rule)
5. The complement of an event A, denoted by Ā or Ac, is
the event that A does not occur then P(Ac) = 1 ‐P(A)
(complementary events)
Biostatistics -Notes WA , SPH AAU ,2016
Basic Probability Rules
1. Addition rule
A. If events A and B are mutually exclusive:
P(A or B) = P(A) + P(B)
P(A and B) = 0
If not mutually exclusive:
P(A or B) = P(A) + P(B) - P(A and B)
P(event A or event B occurs or they both occur)
Biostatistics -Notes WA , SPH AAU ,2016
Example: The probabilities below represent years of
schooling completed by mothers of newborn infants
1. What is the probability that a
mother has completed < 12
years of schooling?
2. What is the probability that a
mother has completed 12 or
more years of schooling?
Biostatistics -Notes WA , SPH AAU ,2016
Class work
The probability that at least three individuals
among the five develop hepatitis B is
Biostatistics -Notes WA , SPH AAU ,2016
Basic Probability Rules
What is the probability that a mother has
completed < 12 years of schooling?
P( 8 years) = 0.056 and
P(9-11 years) = 0.159
Since these two events are mutually exclusive,
P( 8 or 9-11) = P( 8 U 9-11)
= P( 8) + P(9-11) = 0.056+0.159
= 0.215
What is the probability that a mother has completed 12 or
more years of schooling?
P(12) = P(12 or 13-15 or 16) = P(12 U 13-15 U 16)
= P(12)+P(13-15)+P(16)
= 0.321+0.218+0.230
= 0.769 Biostatistics -Notes WA , SPH AAU ,2016
Basic Probability Rules
B. If A and B are not mutually exclusive events,
then subtract the overlapping:
P(AU B) = P(A)+P(B) − P(A ∩ B)
Biostatistics -Notes WA , SPH AAU ,2016
Basic Probability Rules
2. Multiplication rule
If A and B are independent events, then
P(A ∩ B) = P(A) × P(B)
More generally, if dependent
P(A ∩ B) = P(A) P(B|A) = P(B) P(A|B)
P(A and B) denotes the probability that A and B both
occur at the same time.
Biostatistics -Notes WA , SPH AAU ,2016
Conditional Probability
Refers to the probability of an event, given that another
event is known to have occurred.
“What happened first is assumed”
Hint - When thinking about conditional probabilities,
think in stages. Think of the two events A and B occurring
chronologically, one after the other, either in time or
space.
• Conditional probabilities, probabilities based on the
knowledge that some event has occurred.
Biostatistics -Notes WA , SPH AAU ,2016
Conditional Probability
• Conditional probabilities are denoted by P(B/A) or
P(Event/conditioning event).
• The formula for calculating a sample conditional
probability is :
Biostatistics -Notes WA , SPH AAU ,2016
Conditional Probability
The conditional probability that event B has occurred
given that event A has already occurred is denoted
P(B|A) and is defined
Provided that P(A) ≠ 0.
Biostatistics -Notes WA , SPH AAU ,2016
Conditional Probability
Example1.
Table1. A study investigating the effect of prolonged exposure to
bright light on retina damage in premature infants.
Retinopathy Retinopathy TOTAL
YES NO
Bright light 18 3 21
Reduced light 21 18 39
TOTAL 39 21 60
Biostatistics -Notes WA , SPH AAU ,2016
• Pr(D+/reduced light)= Pr(D+&Reduced
light)/Pr(reduced light)
=21/60/39/60=21/39=54%
• Pr(D+/bright light)=Pr(D+& bright light)
/Pr(bright light) =18/60/21/60=18/21=86%
Biostatistics -Notes WA , SPH AAU ,2016
Conditional Probability
• We want to know whether the probability of retinopathy
for the bright‐light infants differs form the probability of
retinopathy for the reduced‐light infants.
These probabilities are
• We want to compare the probability of retinopathy, given
that the infant was exposed to bright light, with that the
infant was exposed to reduced light.
• Exposure to bright light and exposure to reduced light are
conditioning events, events we want to take into account
when calculating conditional probabilities.
Biostatistics -Notes WA , SPH AAU ,2016
Conditional Probability
• For the retinopathy data, the conditional probability of
retinopathy, given exposure to bright light, is
• P(Retinopathy/exposure to bright light) is
= No. of infants with retinopathy exposed to bright light
No. of infants exposed to bright light
= 18/21 = 0.86
• P(Retinopathy/exposure to reduced light)
= No. of infants with retinopathy exposed to reduced light
No. of infants exposed to reduced light
= 21/39 = 0.54
• The conditional probabilities suggest that premature infants
exposed to bright light have a higher risk of retinopathy than
premature infants exposed to reduced light.
Biostatistics -Notes WA , SPH AAU ,2016
Class work
Table 2, shows the frequency of cocaine use by gender
among adult cocaine users.
_________________________________________________________________________________________________________________
Life time frequency Male Female Total
of cocaine use
__________________________________________________________________________________________________________________
1-19 times 32 7 39
20-99 times 18 20 38
more than 100 times 25 9 34
----------------------------------------------------------------------------------------------------
Total 75 36 111
----------------------------------------------------------------------------------------------------------------------
1. What is the probability of a person randomly picked is a male?
2. What is the probability of a person randomly picked uses cocaine more than 100
times?
3. Given that the selected person is male, what is the probability of a person
randomly picked uses cocaine more than 100 times?
4. Given that the person has used cocaine less than 100 times, what is the
probability of being female?
5. What is the probability of a person randomly picked is a male and uses cocaine
more than 100 times?
Biostatistics -Notes WA , SPH AAU ,2016
Conditional Probability
1. For independent events A and B,
P(A/B) = P(A).
2. For non independent events A and B
P(A and B) = P(A/B) P(B), (General Multiplication Rule)
3. Bays theorem:
P(A/B) = P(B/A) P(A)
P(B)
Biostatistics -Notes WA , SPH AAU ,2016
Conditional Probability
Home work
From a city population, the probability of selecting a male or a smoker
is 7/10, a male smoker is 2/5, and a male, if a smoker is already
selected is 2/3 . Find the probability of selecting (a) a non-smoker, (b)
a male, and (c) a smoker, if a male is first selected.
Let A: a male is selected
B: a smoker is selected. We are given
P(AB) =7/10 , P(AB) =2/5 , P(A|B) = 2/3
The probability of selecting a non-smoker is
P(Bc) = 1–P(B) = 1 - P(AB)/ P(A|B)
[P(A/B) = 1- P(AB)/ P(B) =
1 –(2/5)/(2/3) P(B’) = 1 -3/5=2/5
The probability of selecting a male (by addition theorem) is:
P(A) = P(AB) + P(AB) – P(B)
= (7/10 )+(2/5)-(3/5)=1/2
Class work
Find the probability of selecting a smoker if a male is first selected is
Biostatistics -Notes WA , SPH AAU ,2016
P(B|A) ????
Home work
1. Consider the experiment of tossing a fair die and
define the following events:
A = {Observe an even number of dots}
B = { Observe a number of dots less or equal to 4}.
Are events A and B independent?
2. Suppose that three programmers are designing computer code for a
project: Mr. A has designed 60% of the code, Mr. B 30% and Mr. C
10%. Suppose further that Mr. A has a bug in 3% of her work, Mr. B
in 7% of her work, and Mr. C in 5% of his.
A. What percentage of the code written has a bug?
B. Given that you find a bug in a line of code, who is most likely to
have written it? Who is least likely?
C. How does the ordering compare to the unconditional probabilities
and why does this relationship make
Biostatistics -Notes WA , SPHsense?
AAU ,2016
Baye’s Theorem
• In the health sciences field a widely used application of probability
laws and concepts is found in the evaluation of screening tests and
diagnostic criteria.
• Of interest to clinicians is an enhanced ability to correctly predict
the presence or absence of a particular disease from a knowledge of
test results (positive or negative) and/ or the status of presenting
symptoms (present or absent).
Biostatistics -Notes WA , SPH AAU ,2016
Baye’s Theorem
• Also of interest is information regarding the likelihood of
positive and negative test results and the likelihood of the
presence or absence of a particular symptom in patients
with and without a particular disease.
• In our consideration of screening tests, we must be aware
of the fact that they are not always perfect. That is, a
testing procedure may yield a false positive or a false
negative.
Biostatistics -Notes WA , SPH AAU ,2016
Bayes Theorem
Total probability
If the event B may occur together with one and only one
of n mutually exclusive events A1, A2, ..., An then
P(B)= P(A1)P(B|A1)+P(A2)P(B|A2)+ ...+P(An)P(B|An).
Bayes’s Formula
If the event B may occur together with one and only one
of n mutually exclusive events A1, A2, ..., An then
P(Ak )P(B|Ak ) P(Ak )P(B|Ak )
P(Ak|B) n
P(A j )P(B|A j )
P(B)
j 1
Biostatistics -Notes WA , SPH AAU ,2016
Sensitivity and Specificity
• Data for assessing the sensitivity and specificity of a test are usually
of the form
Disease Category
Test result Diseased(+) Nondiseased (- total
)
+ A B A+B
- C D C+D
total A+C B+D 1.00
Sensitivity: is the proportion of diseased people who would
be correctly classified
estimated by Sens = A/(A + C).
Specificity: is the proportion of non diseased people who
would be correctly classified
estimated by Spec = D/(B
Biostatistics + D).
-Notes WA , SPH AAU ,2016
Sensitivity and Specificity
• The prevalence of a disease is the percent of the population
with the disease estimated by R = (A + C)/(A + B + C + D).
Note that a random sample is required to estimate prevalence.
• Positive Predictive Value: is the proportion of people who
tested positive that truly are positive.
estimated by PPV =A/(A + B).
• Negative Predictive Value: is the proportion of people who
tested negative that truly are negative.
estimated by NPV =D/(C + D).
• False Negative: The probability of a false negative is the
probability of testing negative given a truly positive condition.
• False Positive: The probability of a false positive is the
probability of testing positive given a truly negative condition.
Biostatistics -Notes WA , SPH AAU ,2016
Example1
Data for assessing the sensitivity and specificity of a test are usually of
the form
Disease Category
Test result Diseased(+) Nondiseased (-) total
+ 10000 5000 15000
- 1000 84000 85000
total 11000 89000 100000
The estimated Sensitivity is Sens = A/(A + C)=90.9%
The estimated Specificity is Spec = D/(B + D)=94.4%
The estimated prevalence is R = (A + C)/(A + B + C + D)=11.00%.
The estimated PPV is PPV =A/(A + B)=66.7%
The estimated NPV is NPV =D/(C + D)=98.8%
Biostatistics -Notes WA , SPH AAU ,2016
PROBABILITY DISTRIBUTION
Biostatistics -Notes WA , SPH AAU ,2016
Probability distribution
• Every random variable has a corresponding probability
distribution.
• A probability distribution applies the theory of probability
to describe the behavior of the random variable.
• The term Probability distribution or just distribution refers
to the way data are distributed, in order to draw
conclusions about a set of data.
Biostatistics -Notes WA , SPH AAU ,2016
Probability distribution
• Probability distribution is listing of all the possible values
that a random variable can take along with their
probabilities.
• A probability distribution of a random variable can be
displayed by a table or a graph or a mathematical formula.
• Random Variable is any quantity or characteristic that is
able to assume a number of different values such that any
particular outcome is determined by chance
• Random variables can be either discrete or continuous
Biostatistics -Notes WA , SPH AAU ,2016
• HHH HHT HTH THH
• TTT TTH THT HTT
• 0 1/8
• 1 3/8
• 2 3/8
• 3 1/8
Biostatistics -Notes WA , SPH AAU ,2016
Probability distribution
• The random variable domain is the sample space and its
range is the set of real numbers.
Example1 Number of HIV+ patients up on taking a single
blood test to determine the status.
Example2 Observe 100 babies to be born in a clinic. The
number of boys, which have been born, is a random
variable. It may take values from 0 to 100.
Example3 Select one student from an university and
measure his/her height and record this height by x. Then x
is a random variable, assuming values from, say from 100
cm to 250 cm independence upon each specific student.
Biostatistics -Notes WA , SPH AAU ,2016
Basic definition
A discrete random variable is able to assume only a finite or
countable number of outcomes
A continuous random variable can take on any value in a specified
interval.
Example 1 Experiment is surgery on two people. Outcomes are {ss,sf,fs,ff}.
Example2 Experiment is to observe the number of people that get tested for
HIV in one week at a given clinic. Suppose 500 is the maximum
possible number of tests given in a week. Then any non-negative
integer less than or equal to 500 is a conceivable outcome.
X = number of tests in a given week.
Example3 Experiment is to record the number of places that a person has
lived in his or her lifetime. Possible outcomes are {1; 2; 3; …,}
X = number of places a person has lived.
Example4 . Experiment is to record the sex of a person. Outcomes {m, f}
Biostatistics -Notes WA , SPH AAU ,2016
Discrete Probability distributions
• For a discrete random variable X, a probability
distribution is a function that assigns to any possible value
x of X the probability P(X = x).
Two Requirements for a Probability Distribution:
1. The sum of the probabilities of all the events in the
sample space must equal 1; that is
ΣP(X)=1.
2. The probabilities of each event in the sample space must
be between or equal to 0 and 1. That is, 0≤P(X)≤1.
Biostatistics -Notes WA , SPH AAU ,2016
Example1:
• Consider again the experiment of taking a single blood
test to determine HIV status. Let the random variable X
denote the number of positive tests.
• Then X(HIV+)=1, X(HIV-)=0
If we knew that the prevalence of HIV was 0.11, then
P(X = 1) = 0.11 and P(X = 0) = 0.89
• These two equations completely describe the probability
distribution of the discrete (dichotomous) random
variable X.
Biostatistics -Notes WA , SPH AAU ,2016
Example 2 Consider the value on the face showing
up from tossing a die.
• The probability distribution of this variable is
Value on Face 1 2 3 4 5 6
Probability 1/6 1/6 1/6 1/6 1/6 1/6
• Notice that the total probability is 1.
Biostatistics -Notes WA , SPH AAU ,2016
• Example -3
The data shows the number of diagnostic services
a patient receives
Biostatistics -Notes WA , SPH AAU ,2016
• What is the probability that a patient receives exactly 3
diagnostic services?
P(X=3) = 0.031
• What is the probability that a patient receives at most one
diagnostic service?
P (X≤1) = P(X = 0) + P(X = 1)
= 0.671 + 0.229
= 0.900
• What is the probability that a patient receives at least four
diagnostic services?
P (X≥4) = P(X = 4) + P(X = 5)
= 0.010 + 0.006
= 0.016
Biostatistics -Notes WA , SPH AAU ,2016
Expected Value of a Discrete Random variable
• The average value assumed by a random variable is called
its expected value, or the population mean
• It is represented by E(X) or µ=ΣX.P(X) the symbol E(X) is
used for the expected value.
Example expected value For the diagnostic service data:
Mean (X) = 0(0.671) +1(0.229) +2(0.053) +3(0.031) +4(0.010)
+5(0.006)
= 0.498 ≈ 0.5
• We would expect an average of 0.5 services for each visit
Biostatistics -Notes WA , SPH AAU ,2016
Variance of a Discrete Random Variable
• The variance of a random variable X is called the
population variance and is represented by Var (X) or σ2
σ2 = ∑(xi-µ)2P(X=xi)
Variance for above diagnostic service is
σ2 = ∑(xi-µ)2P(X=xi) = (0 − 0.5)2(0.671) +(1 − 0.5)2(0.229)
+(2 − 0.5)2(0.053) +(3 − 0.5)2(0.031)+(4 − 0.5)2(0.010)
+(5 − 0.5)2(0.006) = 0.782
Standard deviation = σ = √0.782 = 0.884
Biostatistics -Notes WA , SPH AAU ,2016
Factorials
• Given the positive integer n, the product of all the whole
numbers from n down through 1 is called n factorial and is
written n!.
• n! = nx(n‐1)x(n‐2)x…x2x1 = nx(n‐1)!
• By definition; 0!=1.
Biostatistics -Notes WA , SPH AAU ,2016
Factorials
• Permutation: An ordered arrangement of objects.
• Combinations: An arrangement of objects without
regard to order.
Biostatistics -Notes WA , SPH AAU ,2016
Binomial distribution
• It is one of the most widely encountered discrete
distributions.
• The origin of binomial distribution lies in Bernoulli’s trials.
• When a single trial of some experiment can result in only
one of two mutually exclusive outcomes (success or
failure; dead or alive; sick or well, male or female) the trail
is called Bernoulli trial.
Example1.
– Let X represents smoking status; X=1 smoker and X=0
non-smoker. The two outcomes are mutually exclusive.
– Take the case of USA; in 1987, 29% of the adults in USA
were smokers, therefore Pr (X=1) = 0.29 and Pr (X=0) =
1-0.29 = 0.71.
Biostatistics -Notes WA , SPH AAU ,2016
Binomial distribution
• Suppose an event can have only binary outcomes A and B.
Pr (X=success) = Pr (X=1) = p
• Pr (X=failure) = Pr (X=0) = 1-p
• If an experiment is repeated n times and the outcome is
independent from one trial to another, the probability
P(X=x) that outcome X occurs exactly x times is
Pr (X= x) = n! p x (1- p) n- x
x ! (n- x )!
where , n (trials) & p (each probability outcome of event X)
are parameters of the binomial distribution , x is number of
successes. and n! read as ”n factorial” or factorial n” is the
product of all integers 1 to n inclusive. By definition
1!=0!=1.
Biostatistics -Notes WA , SPH AAU ,2016
Binomial distribution
Example 2
Suppose now we randomly select two individuals in USA, see the
smoking status of the two persons,
What is the probability
– That both are non smokers?
– one is a smoker?
– both are smokers?
If Pr (X=1) = p and pr (X=0) = 1- p, then the above can be calculated
using the multiplicative rule.
_________________________________________________________________________________________________________________
Outcome of X
Person1 Person2 Prob No of smokers
_____________________________________________________________________________________________________________________
0 0 (1- p)(1- p)=0.71×0.71=0.50 0
0 1 (1- p) p=0.71×0.29=0.21 1
1 0 p (1- p)=0.29×0.71=0.21 1
1 1 p p=0.29 ×0.29=0.08 2
_______________________________________________________________
Biostatistics -Notes WA , SPH AAU ,2016
Characteristics of a Binomial Distribution
1. The experiment consist of n identical trials. There are
only two possible mutually exclusive outcomes, on each
trial.
2. The probability of A remains the same from trial to trial.
This probability is denoted by p, and the probability of B
is denoted by q. Note that q=1‐ p.
3. The trials are independent.
4. The binomial random variable X is the number of A’s in n
trials. n and p are the parameters of the binomial
distribution.
5. The mean is np and the variance is np(1‐ p)
Biostatistics -Notes WA , SPH AAU ,2016
The general form of the Binomial pmf is given by:
• b(x; n, p) = nCx px qnx , (where q = 1 p), and its
cumulative density function
( cdf )is given by:
x x
F(x) = B(x; n, p) = b(i; n, p) =
i 0
n Ci p i q ni
i 0
It is paramount to observe that the binomial random variable ,
X, is the sum of n independent Bernoulli random variable, Xi,
i.e., X = X1 + X2 + ... + Xn
Where Xi represents the Bernoulli rv at the ith trial whose value is
equal to 0 or 1 (0 for failure and 1 for success) so that the Rx =
0, 1, 2, ..., n.
Biostatistics -Notes WA , SPH AAU ,2016
Class work 1
1. Each child born to a particular set of parents has a probability
of 0.25 of having blood type O. If these parents have 5
children. What is the probability that
a. Exactly two of them have blood type O
b. At most 2 have blood type O
c. At least 4 have blood type O
d. 2 do not have blood type O.
Biostatistics -Notes WA , SPH AAU ,2016
Class work 2
2. Suppose you take a sample of N independent biologists
to determine how many of them use valid statistical
methods.
• In particular, you have a sample of N independent,
identically distributed RVs. With Yi with p=P(Y=1)
• What is the distribution of the number of successes
Y=∑NI=1 Yi in N trials? Y~Bin(y;N,p)
• Calculate the probability that 0 out of 10 biologists use valid
statistical methods when the probability of using valid statistical
methods is 0.8
Biostatistics -Notes WA , SPH AAU ,2016
The Poisson distribution
• Discrete probability distribution is used to model the
number of occurrences of an event that takes place
infrequently in time or space
• Applicable for counts of events over a given interval of
time, for example:
– number of patients arriving at an emergency
department in a day
– number of new cases of HIV diagnosed at a clinic in a
month
– Daily number of new cases of breast cancer notified
to a cancer registry
– Number of abnormal cells in a fixed area of
histological slides from a series of liver biopsies
Biostatistics -Notes WA , SPH AAU ,2016
The Poisson distribution
• The theoretical situation giving rise to data of this type is
easier to describe in relation to events occurring over
time (or space) at a fixed rate on average, but where each
event occurs independently and at random.
• Such data will have a Poisson distribution
• Suppose events happen randomly and independently in
time at a constant rate. If events happen with rate l
events per unit time, the probability of x events
happening in unit time is:
Biostatistics -Notes WA , SPH AAU ,2016
• where x = 0, 1, 2, . . .x is a potential outcome of X
• t time of segment of interest
• The constant (lambda) represents the rate at which
the event occurs, or the expected number of events
per unit time
• e = 2.71828
• It depends up on just one parameter, which is the )
Biostatistics -Notes WA , SPH AAU ,2016
Three assumptions of Poisson distribution
1. The probability that a single event occurs within a
given small subinterval is proportional to the
length of the subinterval
2. The rate at which the event occurs is constant over
the entire interval t
3. Events occurring in consecutive subintervals are
independent of each other
Biostatistics -Notes WA , SPH AAU ,2016
Example
Example1
The daily number of new registrations of cancer is 2.2 on average.
• What is the probability of
a) Getting no new cases
b) Getting 1 case
c) Getting 2 cases
d) Getting 3 cases
e) Getting 4 cases
solution
• a) P(X=0)= 0 .111
• b) P(X=1) = 0.244
• c) P(X=2) = 0.268
• d) P(X=3) = 0.197
• e) P(X=4) = 0.108
Biostatistics -Notes WA , SPH AAU ,2016
The Poisson distribution
• Characteristics;
• The Poisson distribution is very asymmetric when its mean
is small
• With large means it becomes nearly symmetric
• It has no theoretical maximum value, but the probabilities
tail off towards zero very quickly
• λ is the parameter of the Poisson distribution
• The mean is λ and the variance is also λ.
Biostatistics -Notes WA , SPH AAU ,2016
Probability distribution of continuous variables
• Under different circumstances, the outcome of a random
variable may not be limited to categories or counts.
Example 1
– Suppose, X represents the continuous variable
‘Height’; rarely is an individual exactly equal to 170cm
tall
– X can assume an infinite number of intermediate
values 170.1, 170.2, 170.3 etc.
• Because a continuous random variable X can take on an
uncountable infinite number of values, the probability
associated with any particular one value is almost equal to
zero.
Biostatistics -Notes WA , SPH AAU ,2016
Probability distribution of continuous variables
• However the probability that X will assume
some value in the interval enclosed by two
ranges say x1 and x2 is a value greater than
given by
• As a continuous variable can take an infinite
number of values, it helps to visualize the
probability distribution as a curve and
probabilities as ‘area under the curve’.
• It is also called normal distribution.
Biostatistics -Notes WA , SPH AAU ,2016
Normal Distribution
• The Normal Distribution is by far the most important
probability distribution in statistics.
• It is also sometimes known as the Gaussian distribution,
after the mathematician Gauss.
• The distributions of many medical measurements in
populations follow a normal distribution (eg. Serum uric
acid levels, cholesterol levels, blood pressure, height and
weight)
• The normal distribution is a theoretical, continuous
probability distribution whose equation is:
for -∝ < x < +∝
Biostatistics -Notes WA , SPH AAU ,2016
Normal Distribution
• The normal distribution for any given interval
between a and b is:
Biostatistics -Notes WA , SPH AAU ,2016
Characteristics of the Normal Distribution
1. It is a probability distribution of a continuous variable. It
extends from minus infinity( -∞) to plus infinity (+∞).
2. It is unimodal, bell-shaped and symmetrical about x = u.
3. It is determined by two parameters: referred as the mean μ
(read as ‘mu’) and standard deviation σ (read ‘sigma’).
– Changing μ alone shifts the entire normal curve to the left or
right.
– Changing σ alone changes the degree to which the distribution
is spread out.
– The mean μ can be any number (negative, positive or zero).
– The standard deviation σ must be a positive number.
Biostatistics -Notes WA , SPH AAU ,2016
Characteristics of the Normal Distribution
4. The height of the frequency curve, which is called the
probability density, cannot be taken as the probability of a
particular value.
– This is because for a continuous variable there are infinitely
many possible values so that the probability of any specific
value is zero.
5. An observation from a normal distribution can be related to a
standard normal distribution: (SND) which has a published
table.
– Thus an observation x from a normal distribution with
mean μ and standard deviation σ can be related to a
Standard normal distribution by calculating :
SND = Z = (x - μ ) / σ
Biostatistics -Notes WA , SPH AAU ,2016
6. Perpendiculars of the area under the curve.
– ± SD contain about 68%;
– ±2 SD contain about 95%;
– ±3 SD contain about 99.7%
7. The distribution is completely determined by
the parameters m and s.
Biostatistics -Notes WA , SPH AAU ,2016
Normal curve
Biostatistics -Notes WA , SPH AAU ,2016
Normal probability
• Normal curve area for Z value of 1.95 in the table
Biostatistics -Notes WA , SPH AAU ,2016