SM_CMT 05101 Epidemiology and Biostatistics

UNITED REPUBLIC OF TANZANIA
Ministry of Health and Social Welfare
CMT 05101
Epidemiology and
Biostatistics
NTA Level 5 Semester 1
Student Manual
August 2010
Copyright © Ministry of Health and Social Welfare – Tanzania 2010
CMT 05101 Epidemiology and Biostatistics NTA Level 5 Semester 1 Student Manual
ii
Table of Contents
Background and Acknowledgement ........................................................................ iv

Introduction .............................................................................................................. ix
Abbreviations ........................................................................................................... xi
Module Sessions
Session 1: Introduction to Biostatistics and Methods for Qualitative Data...............1
Session 2: Descriptive Methods for Quantitative Data............................................13
Session 3: The Normal Distribution ........................................................................25
Session 4: Sampling Techniques .............................................................................33
Session 5: Estimation of Mean and Proportion .......................................................45
Session 6: Significance Tests of One Sample ..........................................................53
Session 7: Chi-Square (χ 2) Test .............................................................................61
Session 8: Source and Uses of Morbidity and Mortality Statistics .........................71
Session 9: Introduction to Epidemiology.................................................................79
Session 10: Ecology and Epidemiological Approach to Causation ........................89
Session 11: Natural History and Levels of Prevention of Diseases ........................99
Session 12: Introduction to Epidemiological Methods/Studies .............................109
Session 13: Case- Control Studies .........................................................................123
Session 14: Cohort Studies ....................................................................................131
Session 15: Testing and Screening of a Disease ....................................................141
Session 16: Control of Epidemics ..........................................................................151
Session 17: Integrated Disease Surveillance and Response ..................................159
Session 18: Planning for Disease Prevention and Control ....................................169
iii
Background and Acknowledgement
In April 2009, a planning meeting was held at Kibaha which was followed up by a Task
Force Committee meeting in June 2009 at Dodoma and developed a proposal which guided
the process of the development of standardised Clinical Assistant (CA) and Clinical Officer
(CO) training materials which were based on CA/CO curricula. The purpose of this process
was to standardize the entire curriculum with up-to-date content which would then be
provided to all Clinical Assistant and Clinical Officer Training Centres (CATCs/COTCs).
The perceived benefit was that, by standardizing the quality of content and integrating
interactive teaching methodologies, students would be able to learn more effectively and that
the assessment of students’ learning would have more uniformity and validity across all
schools.
In September 2009, MOHSW embarked on an innovative approach of developing the

standardised training materials through the Writer’s Workshop (WW) model. The model
included a series of three-week workshops in which pre-service tutors and content experts
developed training materials, guided by facilitators with expertise in instructional design and
curriculum development. The goals of WW were to develop high-quality, standardized
teaching materials and to build the capacity of tutors to develop these materials.
The new training package for CA/CO cadres includes a Facilitator Guide, Student Manual
and Practicum. There are 40 modules with approximately 600 content sessions. This product
is a result of a lengthy collaborative process, with significant input from key stakeholders and
experts of different organizations and institutions, from within and outside the country.
The MOHSW would like to thank all those involved during the process for their valuable
contribution to the development of these materials for CA /CO cadres. We would first like to
thank the U.S. Centers for Disease Control and Prevention’s Global AIDS Program
(CDC/GAP) Tanzania, and the International Training and Education Center for Health (I-
TECH) for their financial and technical support throughout the process. At CDC/GAP, we
would like to thank Ms. Suzzane McQueen and Ms. Angela Makota for their support and
guidance. At I-TECH, we would especially like to acknowledge Ms. Alyson Shumays,
Country Program Manager, Dr. Flavian Magari, Country Director, Mr. Tumaini Charles,
Deputy Country Director, and Ms. Susan Clark, Health Systems Director. The MOHSW
would also like to thank the World Health Organization (WHO) for technical and financial
support in the development process.
Particular thanks are due to those who led this important process: Dr. Bumi L.A.
Mwamasage, the Assistant Director for Allied Health Sciences Training, Dr. Mabula Ndimila
and Mr. Dennis Busuguli, Coordinators of Allied Health Sciences Training, Ministry of
Health and Social Welfare, Dr. Stella Kasindi Mwita, Programme Officer Integrated
Management of Adults and Adolescent Illnesses (IMAI), WHO Tanzania and Stella M.
Mpanda, Pre-service Programme Manager, I-TECH.
Sincere gratitude is expressed to small group facilitators: Dr. Otilia Gowele, Principal, Kilosa
COTC, Dr. Violet Kiango, Tutor, Kibaha COTC, Ms. Stephanie Smith, Ms. Stephanie
Askins, Julie Stein, Ms. Maureen Sarewitz, Mr. Golden Masika, Ms. Kanisia Ignas, Ms.
Yovitha Mrina and Mr. Nicholous Dampu, all of I-TECH, for their tireless efforts in guiding
participants and content experts through the process. A special note of thanks also goes to
iv
Dr. Julius Charles and Dr. Moses Bateganya, I-TECH’s Clinical Advisors, and other Clinical
Advisors who provided input. We also thank individual content experts from different
departments of the MOHSW and other governmental and non-governmental organizations,
including EngenderHealth, Jhpiego and AIHA, for their technical guidance.
Special thanks goes to a team of I-TECH staff namely Ms. Lauren Dunnington, Ms.
Stephanie Askins, Ms. Stephanie Smith, Ms Aisling Underwood, Golden Masika, Yovitha
Mrina, Kanisia Ignas, Nicholous Dampu, Michael Stockman and Stella M. Mpanda for
finalising the editing, formatting and compilation of the modules.
Finally, we very much appreciate the contributions of the tutors and content experts
representing the CATCs/COTCs, various hospitals, universities, and other health training
institutions. Their participation in meetings and workshops, and their input in the
development of content for each of the modules have been invaluable. It is the commitment
of these busy clinicians and teachers that has made this product possible.
These participants are listed with our gratitude below:
Tutors
Ms. Magdalena M. Bulegeya – Tutor, Kilosa COTC
Mr. Pius J.Mashimba – Tutor, Kibaha Clinical Officers Training Centre (COTC)
Dr. Naushad Rattansi – Tutor, Kibaha COTC
Dr. Salla Salustian – Principal, Songea CATC
Dr. Kelly Msafiri – Principal, Sumbawanga CATC
Dr. Joseph Mapunda - Tutor, Songea CATC
Dr. Beda B. Hamis – Tutor, Mafinga COTC
Col Dr. Josiah Mekere – Principal, Lugalo Military Medical School
Mr. Charles Kahurananga – Tutor, Kigoma CATC
Dr. Ernest S. Kalimenze – Tutor, Sengerema COTC
Dr. Lucheri Efraim – Tutor, Kilosa COTC
Dr. Kevin Nyakimori – Tutor, Sumbawanga CATC
Mr. John Mpiluka – Tutor, Mvumi COTC
Mr. Gerald N. Mngóngó –Tutor, Kilosa COTC
Dr. Tito M. Shengena –Tutor, Mtwara COTC
Dr. Fadhili Lyimo – Tutor, Kilosa COTC
Dr. James William Nasson– Tutor, Kilosa COTC
Dr. Titus Mlingwa – Tutor, Kigoma CATC
Dr. Rex F. Mwakipiti – Principal, Musoma CATC
Dr. Wilson Kitinya - Principal, Masasi ( Clinical Assistants Training Centre (CATC)
Ms. Johari A. Said – Tutor, Masasi CATC
Dr. Godwin H. Katisa – Tutor, Tanga Assistant Medical Officers Training Centre (AMOTC)
Dr. Lautfred Bond Mtani – Principal, Sengerema COTC
Ms Pamela Henry Meena – Tutor, Kibaha COTC
Dr. Fidelis Amon Ruanda – Tutor, Mbeya AMOTC
Dr. Cosmas C. Chacha – Tutor, Mbeya AMOTC
Dr. Ignatus Mosten – Ag. Principal, Tanga AMOTC
Dr. Muhidini Mbata – Tutor, Mafinga COTC
Dr. Simon Haule – Ag. Principal, Kibaha COTC
Ms. Juliana Lufulenge - Tutor, Kilosa COTC
Dr. Peter Kiula – Tutor, Songea CATC
v
Mr. Hassan Msemo – Tutor, Kibaha COTC
Dr. Sangare Antony –Tutor, Mbeya AMOTC
Content Experts
Ms. Emily Nyakiha – Principal, Bugando Nursing School, Mwanza
Mr. Gustav Moyo - Registrar, Tanganyika Nursesand Midwives Council, Ministry of Health
and Social Welfare (MOHSW).
Dr. Kohelet H. Winani - Reproductive and Child Health Services, MOHSW
Mr. Hussein M. Lugendo – Principal, Vector Control Training Centre (VCTC), Muheza
Dr. Elias Massau Kwesi - Public Health Specialist, Head of Unit Health Systems Research
and Survey, MOHSW
Dr. William John Muller - Pathologist, Muhimbili National Hospital (MNH)
Mr. Desire Gaspered - Computer Analyst, Institute of Finance Management (IFM), Dar es
Salaam
Mrs. Husna Rajabu - Health Education Officer, MOHSW
Mr. Zakayo Simon - Registered Nurse and Tutor, Public Health Nursing School (PHNS)
Morogoro
Dr. Ewaldo Vitus Komba - Lecturer, Department of Internal Medicine, Muhimbili University
of Health and Allied Sciences School (MUHAS)
Mrs. Asteria L.M. Ndomba - Assistant Lecturer, School of Nursing, MUHAS
Mrs. Zebina Msumi - Training Officer, Extended programme on Immunization (EPI),
MOHSW
Mr. Lister E. Matonya - Health Officer, School of Environmental Health Sciences (SEHS),
Ngudu, Mwanza.
Dr. Joyceline Kaganda - Nutritionist, Tanzania Food and Nutrition Centre (TFNC),
MOHSW.
Dr. Suleiman C. Mtani - Obstetrician and Gynecologist, Director, Mwananyamala Hospital,
Dar es salaam
Mr. Brown D. Karanja - Pharmacist, Lugalo Military Hospital
Mr. Muhsin Idd Nyanyam - Tutor, Primary Health Care Institute (PHCI), Iringa
Dr. Judith Mwende - Ophthalmologist, MNH
Dr. Paul Marealle - Orthopaedic and Traumatic Surgeon, Muhimbili Orthopedic Institute
(MOI),
Dr. Erasmus Mndeme - Psychiatrist, Mirembe Refferal Hospital
Mrs. Bridget Shirima - Nurse Tutor (Midwifery), Kilimanjoro Chrician Medical Centre
(KCMC)
Dr. Angelo Nyamtema - Tutor Tanzania Training Centre for International Health (TTCIH),
Ifakara.
Ms. Vumilia B. E. Mmari - Nurse Tutor (Reproductive Health) MNH-School of Nursing
Dr. David Kihwele - Obs/Gynae Specialist, and Consultant
Dr. Amos Mwakigonja – Pathologist and Lecturer, Department of Morbid Anatomy and
Histopathology, MUHAS
Mr. Claud J. Kumalija - Statistician and Head, Health Management Information System
(HMIS), MOHSW
Ms. Eva Muro, Lecturer and Pharmacist, Head Pharmacy Department, KCMC
Dr. Ibrahim Maduhu - Paediatrician, EPI/MOHSW
Dr. Merida Makia - Lecturer Head, Department of Surgery, MNH
Dr. Gabriel S. Mhidze - ENT Surgeon, Lugalo Military Hospital
Dr. Sira Owibingire - Lecturer, Dental School, MUHAS
Mr. Issai Seng’enge - Lecturer (Health Promotion), University of Dar es Salaam (UDSM)
vi
Prof. Charles Kihamia - Professor, Parasitology and Entomology, MUHAS
Mr. Benard Konga - Economist, MOSHW
Dr. Martha Kisanga - Field Officer Manager, Engender Health, Dar es Salaam
Dr. Omary Salehe - Consultant Physician, Mbeya Referral Hospital
Ms Yasinta Kisisiwe - Principal Nursing Officer, Health Education Unit (HEU), MOHSW
Dr. Levina Msuya - Paediatrician and Principal, Assistant Medical Officers Training Centre
(AMOTC), Kilimanjaro Christian Medical Centre (KCMC)
Dr. Mohamed Ali - Epidemiologist, MOHSW
Mr. Fikiri Mazige - Tutor, PHCI-Iringa
Mr. Salum Ramadhani - Lecturer, Institute of Finance Management
Ms. Grace Chuwa - Regional RCH Coordinator, Coastal Region
Mr. Shija Ganai - Health Education Officer, Regional Hospital, Kigoma
Dr. Emmanuel Suluba - Assistant Lecturer, Anatomy and Histology Department, MUHAS
Mr. Mdoe Ibrahim - Tutor, KCMC Health Records Technician Training Centre
Mr. Sunny Kiluvia - Health Communication Consultant, Dar es Salaam
Dr. Nkundwe Gallen Mwakyusa - Ophthalmologist, MOHSW
Dr. Nicodemus Ezekiel Mgalula -Dentist, Principal Dental Training School, Tanga
Mrs. Violet Peter Msolwa - Registered Nurse Midwife, Programme Officer, National AIDS
Control Programme (NACP), MOHSW
Dr. Wilbert Bunini Manyilizu - Lecturer, Mzumbe University, Morogoro
Editorial Review Team

Dr. Kasanga G. Mkambu - Obstertric and Gynaecology specialist, Tanga Assistant Medical
Officers Training Centre (AMOTC)
Dr. Ronald Erasto Msangi - Principal, Bumbuli COTC
Mr. Sita M. Lusana - Tutor, Tanga Environmental Health Science Training Centre
Mr. Ignas Mwamsigala - Tutor (Entrepreneurship) RVTC Tanga
Mr. January Karungula - RN, Quality Improvement Advisor, Muhimbili National Hospital
Prof. Pauline Mella - Registered Nurse and Profesor, Hubert Kairuki Memorial University
Dr. Emmanuel A. Mnkeni – Medical Officer and Tutor, Kilosa COTC
Dr. Ronald E. Msangi - Principal, Bumbuli COTC
Mr. Dickson Mtalitinya - Pharmacist, Deputy Principal, St Luke Foundation, Kilimanjaro
School of Pharmacy
Dr. Janeth C. Njau - Paediatrician/Tutor, Kibaha COTC
Mr. Fidelis Mgohamwende - Labaratory Technologist, Programme Officer National Malaria
Control Programme (NMCP), MOHSW
Mr. Gasper P. Ngeleja - Computer Instructor, RVTC Tanga
Dr. Shubis M Kafuruki - Research Scientist, Ifakara Health Institute, Bagamoyo
Dr. Andrew Isack Lwali - Director, Tumbi Hospital
Librarians and Secretaries

Mr. Christom Aron Mwambungu - Librarian MUHAS
Ms. Juliana Rutta - Librarian MOHSW
Mr. Hussein Haruna - Librarian, MOHSW
Ms. Perpetua Yusufu - Secretary, MOHSW
Mrs. Martina G. Mturano -Secretary, MUHAS
Mrs. Mary F. Kawau - Secretary, MOHSW
vii
IT support
Mr. Isaac Urio - IT Consultant, I-TECH
Mr. Michael Fumbuka - Computer Systems Administrator – Institute of Finance and
Management (IFM), Dar es Salaam
Dr. Gilbert Mliga

Director of Human Resources Development, Ministry of Health and Social Welfare
viii
Introduction
Module Overview
This module content has been prepared to enhance learning of students of Clinical Assistant
(CA) and Clinical Officer (CO) schools.. The session contents are based on the sub-enabling
outcomes of the curricula of CA and CO. The module sub-enabling outcomes are as follows:
6.1.1 Differentiate determinants of health and diseases of public health importance
atistics
6.1.2 Apply epidemiological methods in assessing distribution of health and diseases
6.1.3 Describe disease causation, prevention and control
6.1.4 Utilise concept of epidemic control measures and disaster preparedness
6.1.5 Utilise tools for gathering epidemiological data
6.2.1 Describe different types and sources of health information and biost
6.2.2 Utilise different methods of data collection
Who is the Module For?

This module is intended for use primarily by students of CA and CO schools. The module’s
sessions give guidance on contents and activities of the session and provide information on
how students should follow the tutor when he/she teaches the module. It also provides
guidance and necessary information for students to read the materials on his/her own. The
sessions also include different activities which focus on increasing students’ knowledge,
skills and attitudes.
How is the Module Organized?

The module is divided into 18 sessions; each session is divided into several sections. The
following are the sections of each session:
• Session Title: The name of the session.
• Learning Objectives – Statements which indicate what the student is expected to have
learned at the end of the session.
• Session Content – All the session contents are divided into subtitles. This section
includes contents and activities with their instructions to be done during learning of the
contents.
• Key Points – Each session has a step which concludes the session contents near the end
of a session. This step summarizes the main points and ideas from the session.
• Evaluation – The last section of the session consists of short questions based on the
learning objectives to check if you understood the contents of the session. The tutor will
ask you as a class to respond to these questions; however if you read the session by
yourself try answering these questions to evaluate yourself if you understood the session.
• Handouts – Additional information which can be used in the classroom while the tutor is
teaching or later for your further learning. Handouts are used to provide extra information
related to the session topic that cannot fit into the session time. Handouts can be used by
the students to study material on their own and to reference after the session. Sometimes,
a handout will have questions or an exercise for students to answer.
How Should the Module be Used?

Students are expected to use the module in the classroom and clinical settings and during self
study. The contents of the modules are the basis for learning Epidemiology and Biostatistics.
Students are therefore advised to learn all the sessions including all relevant handouts and
worksheets during class hours, clinical hours and self study time. Tutors are there to provide
ix
guidance and to respond to all difficulty encountered by students. One module will be
assigned to 5 students and it is the responsibility of the tutor to do this assignment for easy
use and accessibility of the student manuals to students.
x
Abbreviations
AIDS Acquired Immunodeficiency Syndrome
ALu Artemether Lumefantrine
AMREF African Medical Research Foundation
AR Attributable Risk
ASFR Age-Specific Fertility Rate
CBOs Community Based Organizations
CBR Crude Birth Rate
CDC Centers for Disease Control and Prevention
CFR Case Fatality Rate
CHD Coronary Heart Disease
CHMT Council Health Management Team
COTC Clinical Officers Training Centre
DALYs Disability Adjusted Life Years
df Degrees of Freedom
DL Distance Learning
EPI Expanded Program on Immunization
FELTP Field Epidemiology and Laboratory Training Programmes
GFR General Fertility Rate
GRR Gross Reproductive Rate
H1N1 Haemophilus Influenza type N 1
HIV Human Immunodeficiency Virus
IDS Integrated Diseases Surveillance
IDSR Integrated Diseases Surveillance and Response
IMCI Integrated Management of Childhood Illness
KAP Knowledge, Attitudes and Practices
MUCHS Muhimbili University College of Health Sciences
NNT Neonatal Tetanus
OC Oral Contraceptives
OR Odds Ratio
PTB Pulmonary Tuberculosis
PYLL Person Years of Life Lost
RD Risk Difference
RR Relative Risk
SBP Systolic Blood Pressure
SND Standard Normal Deviation
STIs Sexually Transmitted Infection
TDHS Tanzania Demographic and health survey
TFR Total Fertility Rate
TPHA Treponema Pallidum haemaglutination Assay
USA United States of America
VDRL Venereal Disease Research Laboratory
WHO World Health Organization
WHO/AFRO World Health Organization/African Region
xi
xii
Session 1: Introduction to Biostatistics and
Methods for Qualitative Data
Learning Objectives
By the end of this session, students are expected to be able to:
• Define terms used in biostatics
• Explain the need for studying biostatistics in medical science
• List applications of biostatics
• Explain descriptive statistics
• Describe descriptive methods for qualitative data
Introduction to Biostatistics
• Biostatics can be defined as the application of statistics to biological problems.
• To many biomedical scientists, the term is considered to mean the application of statistics
specifically to medical problems.
• For this group of people, therefore, biostatics and medical statistics are synonymous.
Other Terms Used in Biostatistics

• Statistics: A descriptive measure computed from the data of a sample. Statistics is a field
which examines the ‘collection, organization, summarization, and analysis of data’ and
draws inferences regarding that data for a population through observation of a sample.
• Data: The raw material of statistics. Data generally consists of numbers of measurement
or counting of a population sample.
o For example, a nurse may record the temperature of patients (a measurement) or count
the number of patients with a temperature above normal.
• Population: A collection of entities. A statistical population refers to the largest
collection of entities in which we have an interest.
o For example, we may be interested in looking at women of reproductive age who
have had one child. Therefore, our population is limited to only those women aged
15-45 who have one child.
• Sample: Part of a population.
o A sample of the example population of women 15-45 with one child might consist of
an estimated 25 percent of the population.
• Parameter: A descriptive measure computed from the data of a population.
Commonly Used Symbols in Statistics

μ: Population mean σ: Population standard deviation
x̄: Sample mean s: Sample standard deviation
x̃: Median
Two Forms of Statistics

• First, ‘statistics’ as a noun is a plural for the word ‘statistic’ which simply means
numerical statements (i.e. information that is available in numbers). Examples of this
include:
o Hospital data on the number of admissions for some condition in a defined time
period
Session 1: Introduction to Biostatistics and Methods for Qualitative Data 1
o How much of a drug (e.g. Artemether Lumefantrine [ALu]) is distributed to health
units, hospitals, health centers, dispensaries, etc.
• Secondly, statistics as a discipline is a field of study concerned in broad terms with:
o Collecting, organizing and summarizing data in a systematic way
o Drawing of inferences about a population on the basis of only a part of the population
targeted
Note: When referring to the discipline of statistics, the singular form of the word is not used
and has no meaning. For example using the words ‘mathematic’ or ‘physic’ have no meaning
as singulars when referring to the disciplines of mathematics or physics.
• This course is mainly concerned with the second sense of the meaning of statistics, which
is statistics as a discipline.
• The introductory portion of the study of statistics is usually referred to as descriptive
statistics, and the second part is referred to as inferential statistics, which provides
objectives and means for drawing conclusions.
• The kind of biostatistics referred to in this course will be that of medical statistics.
Need for Biostatistics

• At first, it may not be clear why statistics should be taught in medical schools.
• But the element of variability in life is evidence of the need for a standardized technique
to cope with inevitable biological variability.
• In physical sciences, for example, we often deal with constants, for example:
o The number of hydrogen atoms in any single molecule of water is always two.
o The velocity of electromagnetic waves in a given medium is always the same (e.g. the
speed of light is always equal to 3x108 ms-1).
• But in the biological and biomedical sciences, no constituent or characteristic of living
organisms can be defined by a single value which is identical for all individuals.
• Consider, for example, the following general questions:
o What is the normal blood pressure in the human body?
o What is the amount of hemoglobin in blood?
• Answers to these questions suggest variations.
• Some specific questions to medical specialists could be as follows:
o Mr. Physician, what are the limits of error in your blood pressure measurements?
o Mr. Radiologist, what is the probability that your colleague’s reports on these X – ray
films would agree with yours?
o Mr. Pathologist, what proportions of your diagnoses are correct at post mortem?
• These questions suggest that answers need to be quantified in order to cope with the
situation.
• Therefore a numerical approach is needed. Biostatical methods are a numerical approach
that can quantify data and also account for variation.
• We illustrate further the need for biostatistics in medical and health sciences generally
with the following (fictitious) example:
o A study to compare two treatments: new and standard, in which 400 patients (200
males and 200 females) were recruited, gave the following results (see Figure 1
below):
Figure 1: Results of Comparison Between Two Treatments
Treatment Outcome
Improved Did not Total % improved
improve
Standard 80 120 200 40
New 100 100 200 50
Total 180 220 400 45
• With these results one may be tempted to conclude that the new treatment is better than
the old (standard)
• An analysis that looks at the results for male patients separately from the female patients
revealed the following:
Figure 2: Results of Comparison Between Two Treatments Among Females

Treatment Outcome
improve
Standard 32 8 40 80
New 96 64 160 60
Total 128 72 200 64
• From this table, we note that for female patients, the standard treatment shows more
improvement.
• This is exactly the opposite of what we saw in the overall assessment, and one might
expect the new treatment to fair better among the male patients.
• If this holds the conclusion is to be:
o Female patients show more of an improved outcome with the standard treatment,
while male patients show more improvement with the new treatment.
o In practical terms, the decision following this controversial conclusion would be
understandable.
• When we look at the results relating to the male patients we see the following:
Figure 3: Results of Comparison Between Two Treatments Among Males

Treatment Outcome
improve
Standard 48 112 160 30
New 4 36 40 10
Total 52 148 200 26
• Figure 3 above shows that just as in female patients, the standard treatment is also more
effective in male patients.
• Calculations should be checked and verified for the overall rate of improvement of the
standard treatment, for example is (32 + 48) / (40+160) = 40% as shown above.
• With a proper statistical method of analysis, it becomes clear that the difference in
improvement between the two treatments when gender has been taken into account is
20% in favour of the standard treatment.
• Such features are common in medical surveys and are typical aspect of observational
studies.
• In an experimental study, the situation would have been controlled.
• These arguments emphasize the need for biostatistical methods for both data analysis and
study designs.
Application of Biostatistics Methods

• Statistical methods have a role in:
o Official health statistics (statements), e.g. studying trends of number of cases of a
disease over time
o Epidemiology, e.g. association of diseases with some aetiological factors
o Clinical studies, e.g. comparison of treatments in clinical trials
o Human biology, e.g. growth pattern
o Laboratory studies e.g. dose-response studies
o Health service administration, e.g. with limited resources, there may be need to
prioritize target groups for necessary interventions
Introduction to Descriptive Statistics

• Numerical information needs to be summarized before it can be used.
• The methods of summarizing data (methods of descriptive statistics) vary with different
types of data that are generated from different types of variables
• A variable must first be defined, and then different types of data are defined.
Definition of a Variable
• Variable: A term for a characteristic that is different in different members of a population
or sample, such as height.
o This measurement is not constant, so therefore it is variable.
o Variables can be qualitative or quantitative, continuous or discrete.
o Random variables cannot be predicted and are the most useful for statistical purposes.
Figure 4: Examples of Variables

Variable Possible Values
Height (cm) 158, 169.3, 170, 200.6 etc.
Weight (kg) 10.2, 50, 69.4, 84, etc.
Parity 0, 1, 6, 8, 10, etc.
Outcome of disease Recovery, chronic illness, death
Marital status Single, married, widowed, separated, cohabiting
Age (years) 1, 5, 30, 36, etc.
Hemoglobin (g/dl) 8.9, 14.2, 12.7, etc.
Number of AIDS cases 278, 301, 313, 350, etc.
Types of Variables
• There two types of variables:
o Qualitative (categorical) variables
o Quantitative (numerical) variables
Qualitative (Categorical) Variables

• Qualitative variables do not take numerical values (e.g. gender: male/female).
• Outcome of disease (recovery, chronic illness, death)
• Hair color (black, blonde)
• Marital status (single, married, widowed, separate, divorced)
Quantitative (Numerical) Variables
• Quantitative variables take numerical values, for example:
o Age (years): 10, 19, 45, 60
o Height (cms):140, 50.6, 200
o Parity: 0, 1, 2, 3, 4, 5, 6, 10
o Hemoglobin (g/dl): 16.3, 8.9, 12.7
• Quantitative variables are of two types: continuous and discrete
• Continuous variables take any value within meaningful extremes, for example:
o Height (cm): 159, 25, 160.35
o Weight (kg): 71.12, 80.56
o Exact age like 21 yrs 6 months and 4 days
• Discrete variables take only fixed values, in most cases whole numbers, for example:
o Parity: 0, 1, 2, 3, 4, 5, 6, 10
o Age last birth day: 5, 19, 45, 90
o Counts: 1, 2, 3, 4, 5, 9
o Number of AIDS cases: 100, 10000, 34278
Levels of Measurement
• Variables are measured on different levels/scales
• The term ‘measurement’ is used here in a broad sense
• These are nominal, ordinal, and ratio measurements
Refer Handout 1.1: Differences Between Nominal, Ordinal, Interval and Ratio
Measurements
Descriptive Methods for Qualitative Data
Frequency and Relative Frequency Distribution

• Frequency distribution: A presentation of the number of times (or the frequency) that
each value (or group of values) occurs in the study population.
o Frequency distribution helps to give a picture of the shape of the distribution of the
data.
• Unimodal data: Data that only has one peak.
• Bimodal data: Data that has two peaks.
• Multimodal data: Data that has more than two peaks.
o Measures of dispersion help to form a clearer picture of the distribution of the data by
describing the height, or the spread, of the data.
o A frequency distribution can be displayed as a table, a bar chart, a histogram, or a
frequency polygon.
o Each method should be clearly labeled with the frequency number.
o The method usually depends on the type of variable being described.
• Relative frequency distribution: A frequency taken by a value relative to total
frequency of a variable.
• Cumulative relative frequency distribution: The accumulated relative frequency of
distributions as the value of the variable increases.
Use of Tallies in Making Frequency Distribution
• A frequency distribution is normally formed (manually) by a process known as
tallying. This involves the following steps:
o Scan the data and determine the categories
o List the categories
o Work through the data and allocate each observation to the category where it
belongs using the tally marks to keep a count of the number in each category
o Add the tally marks to give the frequency
• The following data show a qualitative variable ‘Result of sputum examination’. If:
o 1 = Smear negative (– ve), culture negative (–ve)
o 2 = Smear negative (– ve), not done
o 3 = Smear positive (+ve) , culture positive (+ve)
1211311332131123113123113113131321131121123
1112122311213111112131131112111323331112111
From the above data:

Value Tally Frequency
Smear-ve, culture-ve – IIII IIII …IIII 144
Smear-ve, culture not done IIII IIII …IIII 40
Smear+ve, culture+ve IIII IIII …IIII 45
Note: IIII indicates 5 observations
Figure 5: Frequency, Relative Frequency and Cumulative Relative Frequency for Sputum
Examination
Value Frequency Relative Cumulative Relative Frequency
Smear -ve, culture -ve 144 62.9 62.9

Smear -ve, culture not done 40 17.5 80.4
Smear +ve, culture +ve 45 19.6 100.0
Total 229 100.0 100.0
Use of Diagrams
• Frequency distributions can be illustrated visually by means of statistical diagrams.
• These diagrams serve two main purposes:
o Presentation of information/data (e.g. report) in articles for ease of appreciation
o To serve as a private aid for further statistical analysis
• Two types of diagrams are commonly used to illustrate qualitative data. These are:
o Pie charts
o Bar charts
Pie Charts
• These are used to express the distribution of individual observations into different
categories.
• Note that the frequencies should be converted into percentages totaling 100 for a pie chart
to be used.
• Example of a pie chart illustrating the distribution of student enrolled for academic year
2009 at Kilosa Clinical Officers Training Centre (Kilosa COTC): first year (A) = 57,
second year (B) 44, third year (C) = 38, and Distance Learning (DL) = 12
Figure 6: The Numbers of Students at Kilosa COTC for Academic Year 2008/9
Bar Chart
• The bar chart is the simplest and most effective means of illustrating qualitative data
• The various categories of a variable are represented on the horizontal axis and the
frequency or relative frequency is represented on the vertical axis
• The length of each bar represents the number of observations (frequency) in each
category or the relative frequency in percentage
• For example, consider the following birth control method mix in a certain population:
Figure 7: Birth Control Method Use in a Certain Population

Birth control method Percent
Abstinence 3
Oral contraceptive 32
Depo-Provera 9
Loop 17
Spermicides 7
Condoms 26
Vasectomy 3
Hysterectomy 2
Norplant 1
Total 100
Note: To use a pie chart for this variable would not be suitable because the diagram will be
too congested. Hence a bar chart is more appropriate (see Figure 8: below).
Figure 8: Bar Chart for Percentage of Birth Control Method Use
2 Norplant
Hysterectomy
Birth control method
3
Vasectomy
26 Condoms
1 7 spermicides
17 Loop
Depo-provera
9
Oral contraceptives
32 Abstinence
3
0 5 10 15 20 25 30 35
Percentage of utilization
Two-Way Tables
• Statistical information on two variables can be presented simultaneously in a form of a
two-way table.
• This table makes the information easier to assimilate by showing many of the properties
of the data at a glance.
• In a two-by-two table, data are presented in rows and columns.
• The format for a table depends upon the data and the aspects of the data which are
important to portray.
• A two-way table should include the following:
o A clear title
o A caption for the rows and columns with units of measurement of the variable
o Labels for each individual row or column, i.e. the values taken by the variable
concerned
o Marginal and grand totals\
Activity: Demonstration
Instructions
Using the scenario below tutor will show how to create a two-way table as shown in Figure 9.
Scenario
In a study to investigate whether or not HIV infection is a risk factor to pulmonary
tuberculosis (PTB), a total of 2165 individuals were examined. Blood samples were also
collected from these individuals for laboratory diagnosis of HIV infection. Of the 2165
individuals examined, 651 were found to be negative for HIV infection. Of those who were
negative, 57 were found to have PTB. 1514 of the HIV positive, 875 were found to have
PTB. This information can be summarized in a two by two table as shown in Figure 9 below.
Figure 9: Pulmonary Tuberculosis Infection by HIV Status
HIV status PTB status
Positive Negative Total
875 639 1514
Positive
(57.8%) (42.2%) (100.0%)
57 594 651
Negative
(8.9%) (91.1%) (100.0%)
932 1233 2165
Total
(43.0%) (57.0%) (100.0%)
Note: Numbers in brackets show the row percentages.
• The cells of a two way table may contain percentages instead of the real counts.
• Calculation of percentages may be row-wise or column-wise depending on the purpose of
the table.
• In the above table, the interest is to investigate whether HIV infection is a risk factor to
PTB.
o The aim is to see whether PTB is higher in HIV positives than in HIV negatives.
o The row percentages are more appropriate in this case.
Key Points
• The term biostatics means the application of statistics to biological health problems.
• There is a need for studying biostatistics in medical science for some standardized
techniques to cope with the inevitable biological variability.
• Biostatics is applied in statistical methods which have a role to play in official health
statistics, epidemiology, clinical studies, human biology, laboratory studies, health service
administration, and there may be need to prioritize target groups for necessary
interventions.
• The descriptive statistics, also known as methods of descriptive statistics, vary with
different types of data that are generated from different types of variables.
• Frequency distribution is a descriptive data method for qualitative data.
• This means that the number of times (or the frequency) that each value (or group of
values) occurs in the study population is tallied and summarized by using a variety of
methods (pie graphs, bar charts, etc.) depending on the type data and purpose.
Evaluation
• What does the term biostatics mean?
• Why do we need to study biostatistics in medical science?
• What are the applications of biostatics?
• What is descriptive statistics?
• What are the descriptive methods for qualitative data?
References
• Bonita R. et al. (2006). Basic Epidemiology (2nd ed.). Geneva, Switzerland: WHO.
• Jones D. et al. (2008). Biostatistics. Work Book-Field Epidemiology and Laboratory
Training Programmes (FELTP).
• McCusker J. (2001). Epidemiology in Community Health, Rural Health Series No. 9
(Revised Edition). Nairobi, Kenya: AMREF.
• Rosner B. (2006). Fundamentals of Biostatistics (6th ed.). Australia, Canada, Singapore,
Spain, United Kingdom, United States: Thomson Brookes/Cole.
• Varkevisser et. al. (1995). Designing and Conducting Health Systems Research Projects,
Volume 2 Part 2 Module 24. Health Systems Research Training Series.
Handout 1.1: Differences Between Nominal, Ordinal, Interval
and Ratio Measurements
Variables are measured on different levels/scales. The term ‘measurement’ is used here in a
broad sense.
Nominal Measurement
• The nominal scale classifies persons or things based on a qualitative assessment of the
characteristic being assessed. It neither includes information on quantity or amount nor
does it indicate ‘more than’ or ‘less than’.
o Example 1: Gender (male or female) is a common nominal variable used in
epidemiologic studies.
o Example 2: Country telephone codes are an example of numeric variables that do not
indicate more or less (country code 82 is not more than country code 37).
o Other examples: These used for identifying various categories that make up a given
variable e.g. Religion: 1 = Muslim, 2 = Christian, 3 = Other. Note that the numbering
codes does not signify ranking and that the categories comprising a nominal variable
cannot occur together and are not related.
Ordinal Measurement
• The ordinal scale also classifies persons or things based on the characteristic being
assessed but does indicate ‘more than’ or ‘less than’. In this sense, it provides more
information than the nominal scale. However, the ordinal scale does not indicate how
much more than or less than.
o Example: Rating students’ performance as being poor, average, good, or excellent
indicates how well students perform and provides a basis for comparison. However,
it does not indicate how much better an excellent performance is compared to a good
one.
Interval Measurement
• The interval scale has the same characteristics of the ordinal scale – classifying persons or
things based on the characteristic assessed and indicating more than or less than – but the
interval scale indicates how much more than or less than.
• The interval scale does not indicate a true zero point, meaning that there cannot be an
absence of a characteristic being measured. Additionally, ratios made with two numbers
in the interval scale do not have meaning.
o Example: Temperature is an interval in that different values can tell you how much
more or less. However, there is no true zero point. The value of zero in temperature
does not indicate absence of temperature. Also, when comparing two temperatures,
their ratio is not meaningful. We would not say that a 90 degree temperature is twice
as hot as a 45 degree temperature.
Ratio Measurement
• The ratio scale includes all the characteristics of the interval scale but does indicate a true
zero point.
o Example: Height and weight measurements indicate how much more or less, but also
have a true zero point. A weight of zero indicates an absence of weight.
Differences Between Nominal, Ordinal, Interval and Ratio Measurements
Nominal Ordinal Interval Ratio

• Classifies • Classifies • Indicates how • Includes all the
persons or persons or much more or characteristics of
things based on things based on less the interval
a qualitative a qualitative • Does not contain scale, but
assessment assessment a true zero point contains a true
• Similar or • More or less but • Cannot create zero point.
dissimilar but not how much meaningful
not more or less more or less ratios of these
• Can be numeric two numbers
but there is no
implication of
more or less
Commonly Used Symbols in Statistics

• μ: population mean
• x̄: Sample mean
• x̃: median
• σ: population standard deviation
• s: sample standard deviation
Session 2: Descriptive Methods for Quantitative
Data
Learning Objectives
• Describe the descriptive methods of quantitative data
• Describe the different methods of presenting frequency distribution data for grouped and
ungrouped data
• Describe the difference between the mean, median and mode
• Calculate the mean, median, variance and standard deviation
Descriptive Methods for Quantitative Data

• Frequency distributions are also used to summarize quantitative data.
• A frequency distribution for quantitative data can be used to summarize ungrouped or
grouped data.
• For discrete variables, the frequency may be tabulated for each value (See Figure 1).
Frequency Distribution for Ungrouped Data

Figure 1: Distribution of Number of Counts of Trypanosome in the Blood of a Rat’s Tail
Count Frequency Relative Frequency Cumulative Frequency
0 4 3.1 3.1
1 27 21.1 24.2
2 27 21.1 45.3
3 20 15.6 60.9
4 16 12.5 73.4
5 17 13.3 86.7
6 12 9.4 96.1
7 2 1.6 97.7
8 1 0.8 98.5
9 2 1.6 100
Total 128 100.0
Frequency Distribution for Grouped Data

• When dealing with a continuous variable or a discrete variable with a wide range of
possible values, a summary frequency table is produced.
• Summary frequency table is formed by distributing the data into classes or groups, and
determining the number of observations belonging to each class.
• Figure 2 (see next page) shows an example of a frequency distribution table for grouped
data.
Session 2: Descriptive Methods for Quantitative Data 13
Figure 2: Frequency Distribution of No. of Lesions caused by Smallpox Virus in an Egg
Membrane
No. of Frequency of No.
Class Mid- Point(x) (fx)
Lesions of Membranes (f)
0- 1 5 5
10 - 6 15 90
20 - 14 25 350
30 - 14 35 490
40 - 17 45 765
50 - 8 55 440
60 - 9 65 585
70 - 3 75 225
80 - 6 85 510
90 - 1 95 95
100 - 0 105 0
110 - 1 115 115
Total 80 3670
Note that the dash symbol (-) means ‘up to but not including’ the next tabulated value. (That is,
according to the table in Figure 2, 10- means 10 is the lower limit while 19 is the upper limit. The
value 15 is therefore the midpoint for the class interval 10.)
The Rules That are Used to Make a Frequency Distribution for Grouped Data
• Determine the Range, R, of values. (R=largest value-smallest value)
• Decide on the number, I, of classes.
o This number depends on the form of data and the requirements of the frequency
distribution, but usually they should be between 5 and 20 for convenience.
• Determine the width of the class interval, W, such that W=R/I.
o A constant width for all classes is preferable.
• Choose the upper and lower limits of the class interval carefully to avoid ambiguities.
• List the intervals in order.
• Use tallies to allocate each observation into the class in which it falls.
• Add the tally marks to obtain class frequencies.
Methods of Presenting Different Data

Histograms
• A histogram is a familiar bar-type diagram.
• The value of the variable is represented on a horizontal scale and the vertical scale
represents the frequency or relative frequency at each value.
• Each bar centers at the midpoint of the class.
• If the frequency distribution constitutes of class intervals that are not equal, it is necessary
to calculate the average frequency per standard interval.
o See example below in Figure 3 for how to develop a histogram from the data.
Figure 3: Frequency Distribution of Age at Loss of Last Tooth
Age Frequency Interval Width Average Number of Year of
Age of Loss Last Tooth
11 – 15 1 5 0.20
16 – 19 7 4 1.75
20 – 24 21 5 4.20
25 – 29 35 5 7.00
30 – 34 40 5 8.00
35 – 44 58 10 5.80
45 – 54 28 10 2.80
55 – 74 10 20 0.50
Total 200
A histogram from Figure 3 will be presented as follows in Figure 4 below:
Figure 4: Histogram Showing Distribution of Age at Loss of Last Tooth
Age at loss of last tooth
Line Diagrams
• Line diagrams are often used to express the change in some quantity over a period of time
or to illustrate the relationship between continuous quantities.
• Each point on the graph represents a pair of values, i.e. a value on the x-axis and a
corresponding value on the y-axis.
• Straight lines then connect the adjacent points.
Figure 5: Line Diagram for Cumulative Number of AIDS Cases in Tanzania 1983 to 1992
1983 1984 1985 1986 1986 1987 1988 1989 1990 1991 1992
Year
Frequency Polygons
• Frequency polygons are a series of points (located at the mid-point of the interval)
connected by straight lines.
• The height of these points is equal to the frequency or relative associated with the values
of the variable (or the interval).
• The end points are joined to the horizontal axis at the mid points of the groups
immediately below and above the lowest and highest non-zero frequencies, respectively.
• Frequency polygons are not as popular as histograms, but are a visual presentation of a
frequency distribution.
• They can easily be superimposed and therefore superior to histograms for comparing sets
of data.
• The following Figure 6 shows the example of a frequency polygon.
Figure 6: Frequency Polygon for the Number of Trypanosomes in the Blood of a Rat’s Tail
Frequency
Counts
Cumulative Frequency Curve

• This is similar to a frequency polygon, but the vertical axis displays cumulative relative
frequency and the point is placed at the upper limit of the interval.
Figure 7: Cumulative Frequency Curve for the Number of Trypanosomes in the Blood of a
Rat’s Tail
Frequency
Counts
• When making a statistical diagram, the axes should be clearly labeled and units of
measurement indicated.
• The choice of scales should be made with care.
Measures of Location or Central Tendency

• The measures of location give the overall magnitude of the values observed for each
variable.
• The three commonest measures of location are the arithmetic mean, the median and the
mode.
The Arithmetic Mean

• The mean is simply the arithmetic average of the data and is calculated by taking the sum
of all values in the number set and dividing that total by the number of values in the
dataset.
• The mean is the most commonly used measure of central tendency.
x̄ =
∑x
n
• For example: consider the following heights of 10 men in centimeters (cm): 165, 167,
169, 169, 171, 173, 175, 176, 176, 169
• The mean height is calculated by adding the heights for the ten men and dividing the sum
by 10.
165 + 167 + 169 + 169 + 171 + 173 + 175 + 176 + 176 + 169
Arithmetic mean =
10
1710
x̄ = = 171 cm
10
o The arithmetic mean is denoted by x̄

o Generally: x = ∑
o Where,
∑ = sum all the values of the variable x from xi=1 to i=n
n= number of observations
• The arithmetic mean can also be calculated from frequency distributions
• Refer to data in Figure 7: multiply each value of the variable with its frequency
• Add them up and divide by the total frequency, for example:
∑ xi fi
x̄ =
∑ fi
• Where xi stands for the value of the variable and fi stands for frequency for value xi
• For example: mean count of trypanosomes in a tail blood of a rat is given by:
(0x4)+(1x27)+2x27)+…+(7x2)+(8x1)+(9x2)
x̄ =
128
402
x̄ = = 3.1
128
• With the grouped data the class midpoint should be used when calculating the mean.
Consider data in Figure 2: the mean number of lesions caused by small pox virus in egg
membranes is:
(5x1)+(15x6)+(25x14)+…+(95x1)+105x0)+115x1)
x̄ =
80
3670
x̄ = =45.8
80
• The arithmetic mean is a preferred measure since it uses more information from each
observation.
• However, it tends to be pulled by extreme values value.
• The following is duration of stay in hospital (in days) for some condition:
5, 5, 5, 7, 10, 20, 102.
• The mean duration of stay is calculated as follows:
154
x̄ = = 22 days
4
• This does not reflect the mean duration of stay
The Median
• The median is the 50th percentile of the values in a dataset and represents the literal
middle of the data.
• The median is found by arranging all values in the dataset in numerical order and then
choosing the middle value.
• If the number of values in a dataset is even, take the mean of the two middle numbers to
find the median.
• For example, below is a series of durations (in days) of absence from classes due to
sickness: 1, 1, 2, 2, 3, 3, 4, 4, 4, 4, 5, 6, 6, 6, 6, 7, 8, 10, 10, 38, 80.
o The median duration is 5 days.
1 th
o Generally, when ‘n’ (number of observations) is odd the median is: /2 (n+1)
observations.
o But when ‘n’ is even, there is no middle observation, and the median is the mean of
1 th
the two middle observations, i.e. /2 (n+1) observation(s).
• In frequency distributions, the median can be obtained by accumulating the frequencies
and noting the value of the variable which divides the data into two equal halves (i.e. an
observation where 1/2n of the observation lies).
o The median is less efficient than the mean because it takes no account of the
magnitude of most of the observations.
o If two groups of observations are pooled, the median of the combined group cannot be
expressed in terms of the medians of the two component groups.
o The median is much less amenable than the mean to mathematical treatments so it is
less used in more elaborate statistical techniques.
• However, if the data are distributed asymmetrically the median is more stable than the
mean.
o For example: Drawing from the data on the duration of stay in the hospital, the
median is 7 which is a more realistic estimate than the calculated mean of 22 days.
The Mode
• The mode represents the value that is found most frequently in a set of numbers, though it
is not often used.
• Note that it is possible to have more than one mode.
o For example: in the following set of numbers (8, 7, 8, 8, 9, 6, 5, 6, 4, 6, 7) the mode is
both 8 and 6, since each is included in the dataset three times.
• This dataset is referred to as bimodal because it has two modes.
• It is also possible not to have a mode in a set of numbers.
o For example: in the following set of numbers (5, 4, 9, 7, 6, 3, 8) there is no number
which occurs more frequently than any other, therefore, there is no mode.
Comparison of Mean, Median and Mode

• When averaging data, it is generally implied that to take the ‘mean’ of the data.
• Technically, however, the ‘average’ could refer to the mean, the median, or the mode of
the data.
• The mean gives the most information about the dataset as a whole, especially when paired
with the standard deviation.
• Therefore, it is preferred to use the mean when possible.
• There are certain advantages to the median, such as:
o The median is resistant to skewing, which is the result of an outlier causing the mean
of the data to shift either to the left or to the right.
o It is not affected by extreme values like the mean is, and it is more representative of
the center of data when data is asymmetrical.
• Consider an example of skewed data: Look at the graph of the population distribution by
state in the United States in Figure 8 below.
Figure 8: Population of the United States by State
40,000,000
35,000,000
30,000,000
25,000,000
Population
20,000,000
Mean Median
15,000,000
10,000,000
5,000,000
0
.Pennsylvania
.New Jersey
.Maryland
.Arizona
.Kentucky
.Hawaii
.Iowa
.Delaware
.Nevada
.New Mexico
.Missouri
.Mississippi
.New York
.Connecticut
.North Carolina
.Florida
.Ohio
.Michigan
.Georgia
.Virginia
.Rhode Island
.Montana
.Illinois
.Massachusetts
.Indiana
.Minnesota
.Colorado
.South Carolina
.Louisiana
.Oregon
.Utah
.Kansas
.Idaho
.Maine
.California
.Tennessee
.Nebraska
.New Hampshire
.South Dakota
.Alaska
.North Dakota
.District of
.Texas
.Alabama
.Arkansas
.Vermont
.Oklahoma
.Wyoming
.Wisconsin
.West Virginia
.Washington
State
Source: Bonita R. et al, 2006
• The states with on the left side of the bar chart/graph have a significantly larger
population than other states
• Because of this, we expect the mean to be higher in value than the median.
• The calculated mean in this sample is 5,811,968.706, which is just marked on the graph
above in Figure 8.
• The median is 4,173,405, also marked on the graph: the mean in this example is greater
than the median.
• A general rule to follow is that if the data is skewed either to the left or to the right, the
median represents the data better than the mean.
• If a sample is normally distributed, the mean and median will be nearly the same.
• With symmetrical data, the mode will be similar as well.
• The mode is rarely used as it can easily be misinterpreted and is not used in statistical
tests.
• When the sample size is small, however, the mode may represent the data most
accurately.
• It is possible that in bimodal data, the modes will be a more accurate description as well.
• The mode is also frequently used to describe qualitative data.
• For example, you might find a modal diagnosis, or use the mode to describe medical
diagnoses by stating the diagnosis that was seen most frequently over a given period of
time.
Measures of Variability
• Measures of variability express the degree of variation or scatter of a series of
observations.
• Common measures of variation are range, variance and standard deviation.
The Range
• Is defined as the difference between the maximum value and the minimum value.
o For example: if the lowest and highest of a series of diastolic blood pressure are 65
mm Hg and 95 mm Hg, then the range = 95-65 = 30 mm Hg.
• The range is seldom used in statistical analysis because:
o It wastes information since it uses information from only two extreme values.
o The two extreme values are more likely to be faulty.
o The range increases with increasing number of observations.
Variance and Standard Deviation

• The variance represents the amount of spread or variability around the mean of a set of
data.
• Because the variance is in units squared, we find the standard deviation to describe our
data in the proper units.
• The symbol s2 is used when we are referring to the variance of a sample and the symbol
σ2 when we are referring to the variance of a population.
• We will almost never know the variance of a population unless we are given a proportion:
1 n n ∑ x i2 − ( ∑ x i ) 2
(n − 1) Σ
s2 = ( x − x ) 2
or
i =1
i
n(n − 1)
o The variance is measure of variability that makes use of the differences from each
observation to the mean (x) i.e. (xi –x).
o If all the differences are added together, and their mean calculated, it gives an
indication of the overall variability of the observations.
o But ∑(xi – x) is always zero since some differences are positive while some are
negative.
o Because of this, the differences are squared.
∑ (xi – x̄ )2
= = ∑ (xi – x̄)2
n
o The variance is the mean value of the squared deviations from the mean. This is called
the sum of squares about the mean.
• Since these differences are squared, the variance is measured in the square of the units in
which the variable X is measured.
o For example, if X is height in cm, the variance will be in cm2.
• A measure of variation that is measured in the original units of the variable is the standard
deviation that is the square root of the variance:
∑ (x i − x)
n
o The standard deviation shows the average deviation of observations from the mean
and the interval x + 2SD covers roughly 95% of all the observations.
o The population variance is in most cases unknown because data are normally not
available for the whole population.
o When this is the case, the population variance, s2 is estimated by the sample variance
s2 :
∑ (xi – x̄)2
s2 =
n–1
• Note a change in the denominator from n to n–1.
• When n–1 is used in the denominator, it gives a better estimate of the population variance
than when n is used.
• To calculate the variance and standard deviation for the seven observation in Figure 9
below, (x –x) and (x – x) has to be calculated and then: ∑ (x - x̄)2
Figure 9: Example of Calculating Mean, Variance and Standard Deviation

Step Example
1. Find the mean of the (1) 23 (2) 33 (3) 37 (4)45 (5) 46 (6) 52 (7) 52 (8) 60
dataset.
23 + 33 + 37 + 45 + 46 + 52 + 52 + 60
x=
8
348
= = 43.5
8
2. Calculate the variance 1
using the formula s2 = × [(23 - 43.5) 2 + (33 - 43.5) 2 + (37 - 43.5) 2 +
(8 - 1)
below.
(45 - 43.5) 2 + (46 - 43.5) 2 + 2(52 - 43.5) 2 + (60 - 43.5) 2 ]
1 n 1
s 2 = × [420.25 + 110.25 + 42.25 + 2.25 + 6.25 +
s2 = Σ
(n − 1) i=1
(x i − x )2 7
2(72.25) + 272.25]
1
s 2 = × 998
7
s = 142.57
2
3. Calculate the standard s = 142.57

deviation.
s = s2 s = 11.94
Note: Variance and standard deviation can be calculated using the shortcut formula for
∑ (xi – x)2.
This is:
∑ (xi – x)2 2 (∑ xi 2 )
= ∑ xi –
n n -1
Activity: Small Group Exercise
Instructions
You will work in small groups to calculate the variance and standard deviation by the
shortcut formula. Use the data from the table below. One group will report their
experience calculating and the answers they came up with and others will share in
discusssion.
sn 1 2 3 4 5 6 7 8
Value 24 34 38 46 47 53 53 61
Activity continued on next page.
∑ (xi - x)2 = ∑ xi 2 - (∑ xi 2 )
s2 = n
n -1 n-1
Key Points
• Histograms, line diagrams, frequency polygons and cumulative frequency curves are the
most common methods used to present data for both grouped and ungrouped data.
• Mean, median, mode are the most common measures used to determine central tendency
while standard deviation, range and variance are common measures for disperse.
Evaluation
• What are descriptive methods of quantitative data?
• What are the different methods of presenting frequency distribution data for grouped and
ungrouped data?
• How can you calculate the mean, median, variance and standard deviation?
References
• Bonita R. et al. (2006). Basic Epidemiology (3rd ed.). Geneva, Switzerland: WHO.
• Jones D. et al. (2008). Biostatistics. Work Book-Field Epidemiology and Laboratory
• McCusker J. (2001). Epidemiology in Community Health, Rural Health Series No 9
(Revised Edition). Nairobi, Kenya: AMREF.
• Rosner B. (2006). Fundamentals of Biostatistics (6th ed.). Australia, Canada, Singapore,
Spain, United Kingdom, United States: Thomson Brookes/Cole.
• Varkevisser et. al. (1995). Designing and Conducting Health Systems Research Projects,
Session 3: The Normal Distribution
Learning Objectives
• Describe the normal distribution curve and its characteristics
• Explain probability distribution and continuous probability distribution
• Demonstrate skills to calculate standard normal distribution (SND)
Introduction to Probability Distributions

• Probability distributions represent the likelihood of the occurrence of different outcomes
(e.g. male, female) for a sample selection.
• The relationship between the values of a variable and the probabilities of their occurrence
can be summarized in a probability distribution.
• If we select a single worker, the probability distribution for the possible outcomes for
gender is simple.
Possible Outcome Probability
Male 0.60
Female 0.40
• If we select three workers then the probability distribution becomes more complicated.
Possible Outcomes Probability
All male 0.216 = (0.60 x 0.60 x 0.60)
2 male, 1 female 0.432 = (0.60 x 0.60 x 0.40)
2 female, 1 male 0.288 = (0.40 x 0.40 x 0.60)
All female 0.064 = (0.40 x 0.40 x 0.40)
Requirements for a Probability Distribution

• For every individual value of X =0 ≤ P(X) ≤ 1
• Where X assumes all possible values X=Σ P(X) = 1
Continuous Probability Distribution

• Different random variables have different probability distributions, but the one which will
be discussed in this session is the Normal or Gaussian distribution.
• This is the most important continuous probability distribution.
Characteristics of the Normal Distribution Curve

• The normal distribution is a bell-shaped curve with both the mean and the median at the
center of the curve.
• The standard normal distribution is a distribution of data with a mean of zero and a
standard deviation of 1. It allows different populations to be compared to each other.
• The formula below is used to calculate the standard score, or the ‘z score’ when
comparing normally distributed populations:
x-μ µ = arithmetic mean/measurement
z=
σ σ = Standard deviation of a mean
Session 3: The Normal Distribution 25
Characteristics of a Normal Distribution Curve
• It is specified by two parameters: the population mean and the standard deviation.
• It is symmetrical around the mean, bell-shaped, and unimodal. This is why the normal
curve is frequently referred to as the ‘bell curve’.
• The mean, median, and mode are all in the middle of the curve.
• The total area under the curve above the x-axis is one square unit with 50% of the area to
the right of the mean and 50% to the left of the mean.
• The area bounded by one standard deviation to the right and one standard deviation to the
left of the mean will represents approximately 68% of the values.
• The area bounded by two standard deviations to the right and two to the left will
represents approximately 95% of the values.
• The area bounded by three standard deviations to the right and three to the left will
represents approximately 99.7% of the values. (i.e. 99.7% of the values will be within
three standard deviations of the mean).
Figure 1: Areas Under the Normal Curve that Lie Between 1, 2 and 3 Standard Deviations on
Each Side of the Mean
Source: Jones D. et al., 2008
• Knowing the mean and standard deviation of a normal distribution allows one to
determine the following values:
o The proportion of individuals who fall into any range of values
o The percentile at which a given value falls
o The value which corresponds to a given percentile
Applications of the Standard Normal Distribution
Activity: Exercise 1
Instruction
The tutor will provide example of calculating standard normal distribution. Follow along
with the calculations.
A study of blood pressure of African American school boys gave a distribution of systolic
blood pressure (SBP) close to the normal with µ = 105.8mm Hg and σ = 13.4mm Hg.
• What percentage of boys would be expected to have SBP greater than 120 mm Hg?
• Calculate SND = 120 – 105.8 = 1.06
13.4 Activity continue on the next page
• From the table of Standard Normal Distribution, the area to the right of SND = 1.06 is
0.14457, so about 14.5% of the boys would be expected to have SBP greater than 120
mm Hg.
Refer to Handout 3.1: Table of Standard Normal Distribution and Figure 2 below.
Figure 2: Distribution Curve Showing the Probability That SBP is Greater Than 120 mm Hg
Source: Jones et al, 2008
Instruction
The tutor will provide example of calculating SND1 and SND2 based on results from Exercise
1.
Example
What percentage of boys would be expected to have systolic blood pressure less than 120 mm
Hg?
• If 14.5% have SBP greater than 120 mm Hg. Then 100 – 14.5 = 85.5% will have SBP
less than 120mm Hg.
• What proportion of boys would be expected to have SBP between 85 and 120 mmHg?
• Calculate SND1 85 – 105.8 = 1.55
13.4
and SND2 = 120 – 105.8 = 1.06

13.4
Refer to Handout 3.1: Table of Standard Normal Distribution and Figure 3 below.
Figure 3: Distribution Curve Showing the Probability that SBP is Between 85 and 120 mm
Hg.
Note: This figure is displayed

using the Table of Normal
Distribution (z).
-1.06 0 1.06
Source: Jones et al, 2008
Instruction
The tutor will provide example of calculating the area between SND1 and SND2 based on
results from Exercise 2.
• The area to the right of SND2 1.06 is 0.14457 and the area to the left of SND1 1.55 is
0.060571, so the proportion with SBP between 85mm Hg and 120mm Hg is 100 – 14.5 –
6.1 = 79.4.
• What will be the range of blood pressures for school boys at 95% confidence limit? If or
within what limits would the central 95% of SBPs be expected?
o If µ = 105.8 and σ = 13.4 then, µ ± 1.96 σ includes 95% of SBP
o 105.8 – 1.96 x 13.4 to 105.8 + 1.96 x 13.4 i.e. 79.5 to 132.1 mm Hg
o i.e. 95% of the school boys have SBPs between 79.5 mm Hg and 132.1 mm Hg.
Refer to Handout 3.1: Table of Standard Normal Distribution
Instruction
You will work in small groups to calculate SND. You will need blank sheets of paper and
calculators for this activity. One group will present their responses and let other groups share
the discussion.
Refer to Worksheet 3.1: Calculating the SND and review instructions.
Key Points
This session emphasized the importance of:
• Normal distribution curve and its characteristics
• Probability distribution
• Continuous probability distribution
Evaluation
• What are the characteristics of a normal distribution curve?
• What are probability distribution and normal probability distribution?
• Give the formula for calculating standard deviation?
References
• Jones, D. et al. (2008). Biostatistics Work Book-Field Epidemiology and Laboratory
• Makwaya, et al. (1997). Lecture Notes in Biostatistics. Department of Epidemiology and
Biostatistics, MUCHS: Tanzania.
• Varkevisser, et. al. (1995). Designing and Conducting Health Systems Research Projects,
Handout 3.1: Table of Standard Normal Distribution
-z z
This table shows the shaded areas

z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09
0.0 1.000 .992 .984 .976 .968 .960 .952 .944 .936 .928
0.1 .920 .912 .904 .897 .889 .881 .873 .865 .857 .849
0.2 .841 .834 .826 .818 .810 .803 .795 .787 .779 .772
0.3 .764 .757 .749 .741 .734 .726 .719 .711 .704 .697
0.4 .689 .682 .674 .667 .660 .653 .646 .638 .631 .624
0.5 .617 .610 .603 .596 .589 .582 .575 .569 .562 .555
0.6 .549 .542 .535 .529 .522 .516 .509 .503 .497 .490
0.7 .484 .478 .472 .465 .459 .453 .447 .441 .435 .430
0.8 .424 .418 .412 .407 .401 .395 .390 .384 .379 .373
0.9 .368 .363 .358 .352 .347 .342 .337 .332 .327 .322
1.0 .317 .312 .308 .303 .298 .294 .289 .285 .280 .276
1.1 .271 .267 .263 .258 .254 .250 .246 .242 .238 .234
1.2 .230 .226 .222 .219 .215 .211 .208 .204 .201 .197
1.3 .194 .190 .187 .184 .180 .177 .174 .171 .168 .165
1.4 .162 .159 .156 .153 .150 .147 .144 .142 .139 .136
1.5 .134 .131 .129 .126 .124 .121 .119 .116 .114 .112
1.6 .110 .107 .105 .103 .101 .099 .097 .095 .093 .091
1.7 .089 .087 .085 .084 .082 .080 .078 .077 .075 .073
1.8 .072 .070 .069 .067 .066 .064 .063 .061 .060 .059
1.9 .057 .056 .055 .054 .052 .051 .050 .049 .048 .047
2.0 .046 .044 .043 .042 .041 .040 .039 .038 .038 .037
2.1 .036 .035 .034 .033 .032 .032 .031 .030 .029 .029
2.2 .028 .027 .026 .026 .025 .024 .024 .023 .023 .022
2.3 .021 .021 .020 .020 .019 .019 .018 .018 .017 .017
2.4 .016 .016 .016 .015 .015 .014 .014 .014 .013 .013
2.5 .012 .012 .012 .011 .011 .011 .010 .010 .010 .010
2.6 .009 .009 .009 .009 .008 .008 .008 .008 .007 .007
2.7 .007 .007 .007 .006 .006 .006 .006 .006 .005 .005
2.8 .005 .005 .005 .005 .005 .004 .004 .004 .004 .004
2.9 .004 .004 .004 .003 .003 .003 .003 .003 .003 .003
3.0 .003
Source: Makwaya et. al, 1997
Worksheet 3.1: Calculating the SND
Instructions
Use Handout 3.1, plain papers, and calculators for this activity. Select a reporter to record and
report the work you do in small groups. Present your responses in plenary.
Question
Suppose the average length of stay in a chronic disease hospital of a certain type of patient is
60 days with a standard deviation of 15. If it is reasonable to assume an approximately
normal distribution of lengths of stay, find the probability that a randomly selected patient
from this group will have a length of stay:
a) Greater than 50 days
b) Less than 30 days
c) Between 30 and 50 days
d) Greater than 90 days
Answers:
Session 4: Sampling Techniques
Learning Objectives
• Describe the concept of sampling
• State the types of sampling methods
• Describe probability (random sampling) and non-probability sampling
• Calculate sample size for estimation of a mean and proportion
Concept of Sampling
Definition of Terms
• Sampling: The process of selecting any portion of a population as representative of that
population.
o In research we are often dealing with groups which are effectively infinite, such as the
number of children under-five in a district.
o In sampling, part of a group (population) is chosen to provide information which can
be generalized to the whole group, although in theory it would be possible to
investigate the whole group.
o Sampling is adopted to reduce labor and costs. If the whole population is studied, the
process is referred to as taking a census.
• Sampling Unit: An element or set of elements considered for selection in some stage of
sampling.
• Sampling Frame: A list of all the sampling units in the population.
• Sampling Scheme: A method of selecting sampling units from a sampling frame.
Types of Sampling Methods
Random vs. Non-Random Sampling

• Selection of the study units can be purposive or random.
• When it is purposive, no valid assessment of sampling error can be made and in many
instances this will lead to some bias.
• If conclusions that are valid for the whole population are to be drawn on the basis of a
sample, then the sample should be representative of that population.
• A representative sample is one that has all the important characteristics of the population
from which it is drawn.
• Selection of a sample on a random basis is a necessary, but not always sufficient,
condition to achieve representativeness.
• We shall consider two main aspects of sampling:
o Sampling Methods
o Sample Size
Refer to Handout 4.1: Sampling Techniques
Session 4: Sampling Techniques 33
• In this discussion, we shall confine ourselves to surveys designed to provide estimates of
certain characteristics of populations, particularly the mean and the proportion, as
opposed to other study types.
Classification of Sampling Methods

• The choice of a particular sampling method is influenced by the availability of a list of
all the units that constitute the study population.
• Sampling Frame: List of all units in population study.
o Examples could be a list of villages, a list of eligible users of family planning
methods, a list of university students, etc.
• Sampling methods can be classified into two types:
o Non- probability sampling
o Probability sampling
Non-Probability Sampling
• There are two common non-probability sampling methods: convenience sampling and
quota sampling.
• Convenience Sampling: The sample is obtained on convenience basis.
o Investigators select the study units that happen to be available at the time of data
collection. (Many hospital-based studies use convenience samples).
o A major limitation of this approach is that the sample drawn may be quite
unrepresentative of the study population.
• Quota Sampling: A fixed predetermined number of sample units from different
categories of the study population is obtained.
o A sample obtained in this manner ensures that a certain number of sample units from
different categories with specific characteristics (such as age, sex and religion) are
represented in the sample.
o It is useful when one desires to provide a balance of study units according to some
characteristics of interest.
o Convenience sampling would not achieve this sort of balance.
Probability Sampling
• Probability sampling is also frequently called random sampling.
• In probability sampling the selection procedure has some element of probability/chance.
In particular, a study unit has a known probability of being selected into the sample.
Types of Probability Sampling

• We shall discuss five forms of probability sampling (also known as random sampling),
and the advantages and disadvantages of each method.
o Simple random sampling
o Systematic Sampling
o Stratified Sampling
o Cluster Sampling
o Multistage Sampling
Simple Random Sampling

• This is the simplest form of random sampling and forms the model for all the basic results
of sampling theory.
• Units in the study population have an equal chance of being selected.
• The steps involved in simple random sampling include:
o Obtain a numbered list of all units in the study population (i.e. availability of
complete sampling frame).
o Decide on the size of the sample.
o Select the required number of units using either the ‘lottery’ system or tables of
random numbers.
Refer to Handout 4.2: Table of Random Sampling Numbers
• Advantages of simple random sampling:

o Simple
o Sampling error easily measured
• Disadvantages of simple random sampling:
o Need complete list of units
o Does not always achieve best representatively
Systematic Sampling
• In systematic sampling, elements in the sample are obtained in a systematic way.
• The steps involved in systematic sampling include:
o Obtain the sampling frame and the size of the study population N.
o Decide on the sample size, n.
o Calculate the sampling interval, k = N/n.
o Select the first element at random from the first k units.
o Include every kth unit from the frame into the sample.
Refer to Handout 4.3: Systematic Random Sampling
• Suppose a sample of 80 individuals is to be selected from a population of 720 people.

Then N = 720 and n = 80 and the sampling interval is:
N 720
K= = =9
n 80
• To determine the first unit in the sample, (4th step in systematic sampling listed
above); select one individual randomly from the 9 individuals on the list.
• If, using simple random sampling, the initial selection was 7, the selected individuals
would be those occupying positions 7, 7+9=16, 16+9= 25, 25+9=34, ..., etc.,
according to the 5th step listed above (i.e., include every kth unit) . This continues until
80 individuals have been obtained.
• Advantages of systematic sampling:

o Ensures representation across list
o Easy to implement
• Disadvantage of systematic sampling:
o Dangerous if list has cycles
Stratified Sampling
• In stratified sampling the population is divided into subgroups (or strata) whereby each
stratum is sampled randomly with a known sample size.
• Strata may be defined according to some characteristics of importance in the survey.
o Examples of strata include occupation, religion, age groups or even locality
(whereby regions of the country may be taken as strata in a national health survey).
• The steps involved in stratified sampling are as follows:
o Divide the population into subgroups (strata).
o Draw a sample (of predetermined size) randomly from each of the stratum.
• An important stratification principle is that the between-strata variability should be as
high as possible, or equivalently that each stratum should be as homogeneous as possible.
That is, units within a stratum should be as much alike as possible and units in different
stratum should be as much different as possible.
• Advantages of stratified sampling:

o Increased precision
o All subgroups represented, allowing separate conclusions about each of them
o Representation of minority is possible
• Disadvantages of stratified sampling:
o Sampling error difficult to measure
o Loss of precision if very small numbers sampled in individual strata
Cluster Sampling
• There are situations in which obtaining a complete list of individuals in the study
population is not feasible or practical, or a complete sampling frame is not available
before the investigation starts.
• In such cases it would be easy and convenient to consider a sampling frame in which the
sampling units are a collection (cluster) of study units.
o Examples of clusters include schools, hospital wards, villages, etc.
• Because the sampling unit is a cluster (e.g. a school) the sampling method is known as
cluster sampling.
• The selection steps will be exactly the same as those for any of the above random
sampling methods but the sampling unit being the cluster.
o Divide the population into clusters.
o Draw a sample (of predetermined size) randomly from each of the clusters.
• Unlike in stratified sampling, an important principle in cluster sampling is that units
within a cluster should as heterogeneous as possible while the between-cluster variability
should be as low as possible.
Advantages of cluster sampling:

o Simple, as no list of units required
o Less resources required
• Disadvantages of cluster sampling:
o Imprecise design effect is large
Multistage Sampling
• Multi-stage sampling is carried out in many (more than 1) stages, and different sampling
techniques can be employed at every stage.
• In this method the sampling frame is divided into a population of first-stage sampling
units, of which a first-stage sample is taken.
• Each first-stage unit selected is subdivided into second-stage sampling units, which are
then sampled.
• The process continues until it is convenient to stop.
• To illustrate multistage sampling consider a health survey of primary school children in
Tanzania mainland.
o An immediate problem to taking a sample of these children is that it is almost
impossible to construct a complete sampling frame.
o A multistage sample might be:
Take a sample of regions.
Within each selected region take a sample of districts.
Within each selected district, take a sample of schools.
Within each selected school, take a sample of school children, and carry out the
investigation.
o The sample would thus be accomplished in four stages. Notice that the construction of
a complete sampling frame for each stage is relatively easy.
o In addition to the advantage of easily identifying complete sampling frames, a
multistage sampling procedure is likely to result in an appreciable cost savings by
concentrating resources at selected schools instead of a sample made up of children
scattered in all parts of the country.
• Sometimes, in the final stage of sampling, complete enumeration of the available units is
undertaken.
• In the above example, once a survey team has reached the level of a school it may cost
little extra to examine all the children in the school; however, it may be worthwhile to
expend that cost and effort for several reasons (such as feelings of exclusion from children
not included within the study in the same school.)
• Advantages of multistage sampling:
o No complete listing of population required
o Construction of a complete sampling frame for each stage relatively easy
o Most feasible approach for large populations (cost saving)
• Disadvantages of multistage sampling:
o Several sampling lists
Calculating Sample Size for Estimation of a Mean and Proportion
Sample Size for Estimation of Mean

• In any random sampling technique, to decide on the sample size for a mean the following
formula is used:
4σ 2
N=
ε2
• To obtain the above formula the following theories apply:

o The investigator will have to determine the maximum likely error ε. (i.e. ε=population
µ-mean x)
o From sampling distribution theory, we know that the interval µ + or – 2σ /n will
include x 95% of the time, where σ is the population standard deviation and n is the
sample size. (Note that the critical value 1.96 has been approximated to 2.) That is,
the maximum likely error is 2σ /√n.
o Thus ε = 2σ /√n. Hence ε2= 4σ 2 /n.
o Therefore, the required sample size, n, is given by n= 4σ 2 /ε2
o This formula implies appropriate knowledge of the population standard deviation σ,
and in almost all surveys it is unknown.
o Thus, it is necessary to replace σ with an estimate. This estimate may be obtained
from results of previous studies on the variable or alternatively be obtained as direct
result from a pilot study.
Sample Size for Estimation of a Proportion

• In any random sampling technique, the following formula is used to decide the sample
size of a proportion:
4 Π ( 1- Π ) 4 Π ( 100- Π )
N= or
ε2 ε2
• The theories are similar to mean sample, but this is taken in prevalence studies so the
population used is only a proportion of the whole population and its conclusion was
drawn following a series of studies.
• If Π (prevalence) is expressed in percent form, ε (Margin of Error) must also be
expressed in a percentage.
• Where no sources for population prevalence exists use 0.5 (50%) as long as ε (Margin of
Error) is taken to be constant because it is justifiable that this value maximizes N when ε
is taken to be constant.
• For 0.95(95%) confidence limits the value of ε is taken to be 0.05(5%).
Instructions
Refer to Worksheet 4.1: Calculating Sample Size for Estimation of a

Proportion and review the instructions. Read the scenario on the worksheet, and calculate
the sample size based on the scenario. You will work for about 10 minutes, and then few
groups will briefly share their responses with the class.
Other Aspects of Sampling
Bias in Sampling
• Bias in sampling refers to the systematic error in sampling procedures that may lead to
distortion in results. Sources of bias in sampling include the following:
o Non-response: This is encountered mainly when subjects refuse to participate. They
may refuse an interview, or forget to fill out a questionnaire. The non-respondents
(particularly those due to refusal) may differ systematically from the respondents.
o Studying volunteers only: The fact that some people volunteer to participate in a
study may mean that they differ from the general population on the factors being
studied.
o Sampling registered patients only: Patients going to a hospital are likely to differ
from those being treated elsewhere.
o Missing cases of short duration: In prevalence studies, cases of short duration (e.g.
fatal cases, cases with short episodes, and mild cases) are more likely to be missed.
o Seasonal bias: If the condition under study exhibits different characteristics in
different seasons of the year, this may lead to a distortion in the results, depending on
the period of data collection.
o Tarmac bias: Selecting a study area on the basis of accessibility will generally
constitute a selection bias.
Ethical Considerations
• If recommendations from a study are intended for the entire study population (e.g all
relevant individuals in a region) then one is bound ethically to ensure the sample studied
is representative of that population.
• Remember that random selection of a sample does not guarantee representativeness.
Key Points
• It is cheaper and more realistic to study a portion of population than to study the whole
population.
• Sampling methods can be grouped into two types: random sampling and non-random
sampling. With random sampling, all the subjects have an equal chance of being selected.
Therefore, there is less likelihood of bias.
• The ideal number of subjects/study units to be included in the study can best be obtained
through sample size calculation using a mean or proportion.
• If conclusions that are valid for the whole population are to be drawn on the basis of a
sample, then the sample should be representative.
Evaluation
• What is sampling?
• What are the types of sampling methods?
• What is random sampling and non-random sampling?
• How can you calculate sample size for estimation of a mean and estimation of a
proportion?
References
• Bonita, R. et al. (2006). Basic Epidemiology, 2nd Edition. Geneva, Switzerland: WHO.
• Jones, D. et al. (2008). Biostatistics Work Book. Field Epidemiology and Laboratory
Training Programmes. (FELTP).
• Makwaya, et al. (1997) Lecture Notes in Biostatistics. Department of Epidemiology and
Biostatistics, MUCHS. Tanzania.
• McCusker, J. (2001). Epidemiology in Community Health. (Revised Edition) Rural
Health Series No 9. Nairobi, Kenya: AMREF.
• Rosner, B. (2006). Fundamentals of Biostatistics,6th edition). Belmont, California:
Thomson Brookes/Cole.
• Varkevisser, C., et al. (1995). Designing and Conducting Health Systems Research
Projects,Vol. 2, Part 2, Module 24. Health Systems Research Training Series.
Handout 4.1: Sampling Techniques
Source: Makwaya et al., 1997.
The diagram above depicts drawing a sample size n using a particular sampling method from
a study population with N units (subjects.) Inferential statistics techniques are then used to
make inferences about the study population on the basis of results from the sample.
Steps:
1. Identify the study population. (Note that it is possible to have several study populations
in one study.)
2. Draw a sample from the study population.
3. Describe the sample by calculating relevant statistics.
4. Make inferences about the parameters.
5. Draw conclusions about the study population.
Handout 4.2: Table of Random Sampling Numbers
Random Sampling Numbers
Source: WHO, 2001
Handout 4.3: Systematic Random Sampling
This is an example of how to systematically do sampling as shown above. The selected

houses that have an arrow pointing at them are the ones selected.
Source: Rosner, B., 2006
Worksheet 4.1: Calculating Sample Size for Estimation of a
Proportion
Instructions:
• Please work in small groups.
• Read the following scenario and answer the questions that follows.
• Be prepared to briefly share your response with the class.
Scenario:
You have been assigned to conduct a study in order to estimate the prevalence (i.e.
proportion) of people affected with Bancroftian filarial infection in the Dar es Salaam region.
A review of literature on the subject reveals that studies done along the East African coastal
strip some years back showed the prevalence to be in the order of 30%.
Question:
• What sample size do you require in order to come up with a reasonable estimate in your
study?
• Give a complete answer. Describe any assumptions or prior decisions that you undertake.
Answer:
Session 5: Estimation of Mean and Proportion
Learning Objectives
• Define sampling errors and non-sampling errors
• Describe the standard error of the mean
• Describe the standard error of the proportion
• Estimate the standard error of the mean
• Estimate the standard error of the proportion
Introduction to Sampling Errors

• We study a sample in order to learn something about the population as a whole. In
general, we wish to estimate characteristics of the population such as:
o The mean value of some measurement
o The proportion of the population with some characteristic
Figure 1: Sample Statistics and Population Parameter Notation

Quantity Sample (statistic) Population (parameter)
Mean x (‘x-bar’) µ (‘mu’)
Variance s2
σ2 (‘sigma squared’)
Proportion p′ or p̂ (‘p-hat’) P
• In general, the sample mean or sample proportion is unlikely to be exactly equal to the
mean or proportion in the population, although the former is intended to estimate the
latter. If the two are exactly equal to one another, it is just by coincidence.
• Our conclusion about a population on the basis of the sample we have taken will almost
always have some error.
• We distinguish between two sorts of error:
o Sampling errors
o Non-sampling errors
Sampling Errors
• Sampling errors are those which arise due to the fact that we have observed only part of
the whole population, and they get less important as the sample size increases.
o For example, an estimate of the mean number of children per household in a certain
district based on two households only (in the district) will certainly be poorer than an
estimate based on a sample of 100 households.
o We say there is less sampling error in the latter situation than in the former. If we
investigated the whole population (i.e. all households in the district) the sampling
error would be zero because we would know the population mean exactly.
Non-Sampling Errors
• Non-sampling errors are due mainly to fault in the sampling process which is likely to
create room for the potential sources of bias. (These are sometimes also referred to as
systematic errors.)
o These errors are potentially serious since the bias they cause may lead to invalid
conclusions.
Session 5: Estimation of Mean and Proportion 45
o Increasing the size of a sample will not necessarily reduce the non-sampling errors.
o Subjects who refuse to participate in an interview or may forget to fill in a
questionnaire may differ systematically from those who diligently respond.
o Non-sampling errors also occur through equipment faults, observer errors and during
data processing through coding, data entry, etc.
o However, in this section we will direct our attention to sampling (also known as
random errors).
Standard Error of the Arithmetic Mean

• Consider the variable X. Suppose we take a sample of ‘n’ units and measure this variable.
The sample mean ‘x’ (given by Dun) may be different from the population value simply
because we have taken a sample. The question is how we measure this sampling
variation.
• Ideally, we could take several samples of size ‘n’ and calculate ‘x’ for each sample.
• It is unlikely that the values of ‘x’ will be the same, but if they were all similar (i.e. at
least close) this would imply that the sampling error is small.
• If, in contrast, the values differed markedly, we would reasonably conclude that the
sampling error is large.
• Let us revisit the issue of the sampling error in the situation of a sample taken once.
• Two properties about the sampling error are apparent:
o The larger the sample size the better the precision in estimating (i.e. large samples are
more likely to produce closer estimates than small samples).
o If the variability of the observations in the parent (study) population is small we
would expect the error to be small, and vice-versa.
o Thus, the sampling error depends on the variability of observations in the population.
• Take a moment to recall the idea of repeatedly taking a random sample of size ‘n,’ and
for each sample calculate the sample mean ‘x’ each time. This would lead to a series of
values of ‘x’ and the natural questions relating to this (new) variable ‘x’ will be about its
distribution as well as the mean and variance of the variable.
How to Calculate the Standard Error and 95% Confidence Interval of a Mean
• When dealing with numerical data you may wish to estimate to what degree the sample
mean varies from the population mean.
• The standard error for the mean is calculated by dividing the standard deviation by the
square root of the sample size:
Standard deviation
S D
Sample Size or
n
• It can be assumed, for a normally distributed variable, that approximately 95% of all
possible sample means lie within two standard errors of the population mean. In other
words, we can be 95% sure that the population mean, of which we want to have the best
possible estimate, lies within two standard errors of our sample mean.
• When describing variables statistically you usually present the calculated sample mean
plus or minus two standard errors.
• This is called the 95% Confidence Interval. It means that you are about 95% certain that
the true population mean is within this interval.
Instructions
Tutor will show example of calculating standard error and confidence interval, as follows:
The weights of a random sample of 11 three-year-old children were taken in a village. The
sample mean was 16 kg and the standard deviation of the sample was 2 kg.
Standard Error:
2
SE = = 0.6 kg
11
95% Confidence Interval:

16 ± (2 × 0.6) = 14.8 to 17.2 kg
This means that we are approximately 95% certain that the mean weight of all three-year-old
children in this population lies between 14.8 and 17.2 kg.
Instructions
Tutor will show another example of calculating standard error and confidence interval.
Follow along with the calculations.
Now, imagine that the size of the random sample in Exercise 1 is increased. The weights of a
random sample of 20 three-year-old children were taken in a village. The sample mean was
16 kg and the standard deviation of the sample was 2 kg.
Standard Error:
2
SE = = 0.45 kg
20

16 ± (2 × 0.45) = 15.1 to 16.9 kg
This means that we are approximately 95% certain that the mean weight of all three-year-old
children in this population lies between 15.1 and 16.9 kg.
Note that the increase in sample size in Example 2 clearly improved the reliability of the
calculation, because the confidence interval was narrower.
Standard Error of the Proportion
How to Calculate the Standard Error And 95% Confidence Interval of A Percentage
• In the previous section we calculated the standard error and the 95% confidence interval
of a sample mean, starting with numerical data. We will now do the same for a
percentage that was calculated from categorical data.
• The formula for calculating the standard error of a percentage is:
p(100 − p)
SE =
n
In this equation, p represents one of the percentages and (100 - p) represents the other
percentage. The standard error of the percentage is obtained by multiplying the
percentages, dividing the result by the number in the sample, and taking the square root.
• Note that instead of percentages we can use proportions. A proportion can take on any
value between 0 and 1.
• The formula for calculating the standard error of a proportion would be:
p(1− p)
SE =
n
Where p equals a proportion of the population, and (1-p) represents the other proportion of
the population.
Instructions
Tutor will provide an example of calculating standard error and confidence interval using
categorical data as follows:
Among a sample of 120 TB patients, which was drawn from the total population of TB
patients in the country, it was found that 28 (or 23.3%) did not comply with their out-patient
treatment. The other 92 (or 76.7%) exhibited a satisfactory degree of compliance.
We now want to calculate the standard error of the percentage of non-compliance (23.3%).
This is done as follows:
If p represents one of the percentages (23.3%) and (100 - p) represents the other (76.7%),
then the standard error of the percentage is obtained by multiplying them, dividing the result
by the number in the sample and taking the square root.
Formula for the Standard Error of a Percentage:

p(100 − p)
SE =
n
Standard Error:
23.3x 76.7
SE = = 3.9
120
Now, we calculate the confidence interval for the percentage of non-compliance in the country.

23.3% ± (2 × 3.9) = 15.5% to 31.1%
This means that we are 95% confident that in the population of all TB patients in the country
from which the sample of 120 was drawn, 15.5% to 31.1% do not comply with their out-patient
treatment.
Activity continued on the next page
If we are to calculate this example as a proportion, we would do it as follows:
Standard Error:
p(1− p)
SE =
n
0.233x0.767
SE = = 0.039
120

0.233 ± (2 × 0.039) = 0.155 to 0.311
Note that with this example, we can extrapolate from our proportions to come to the same
conclusion: We are 95% confident that in the population of all TB patients in the country
from which the sample of 120 was drawn, 15.5% to 31.1% do not comply with their out-
patient treatment.
Activity: Small Group Work
Instructions
REFER to Worksheet 5.1: Calculate Standard Error of a Mean. Read the
worksheet instructions.
Calculate the Standard Error of the mean and 95% confidence interval. Record your
answers.
Question:
The haemoglobin levels of a random sample of 40 five-year-old children were taken in a
village. The sample mean was 13gm/dl and the standard deviation of the sample was 3 gm/dl.
Please work together to calculate the Standard Error of the mean, and to calculate the 95%
confidence interval.
Key Points
• The larger the sample size, the smaller the standard error and the narrower the confidence
interval will be.
• The advantage of having a sufficiently large sample is that the sample mean will be a better
estimate of the population mean.
• At a certain point, increases in sample size demand vast investments in time and money,
whereas the confidence interval only marginally decreases.
Evaluation
• What is standard error?
• Why is it important to know standard error?
References
• Makwaya et al. (1997). Lecture notes in biostatistics. Department of Epidemiology and
Biostatistics. MUHAS.
• Varkevisser et. al. (1995). Designing and conducting health systems research projects
(Volume 2, Part 2, Module 24). Health Systems Research Training Series.
Worksheet 5.1: Calculating Standard Error of a Mean
Instructions
Work in small groups to answer the following question.
Question:
The haemoglobin levels of a random sample of 40 children of five years old were taken in a
village. The sample mean was 13gm/dl and the standard deviation of the sample was 3 gm/dl.
Please calculate the Standard Error of the mean and the 95% confidence interval.
Each group should record their answer on a flipchart (or on a section of chalkboard) in the
classroom.
Answer:
Session 6: Significance Tests of One Sample
Learning Objectives
• Define the terms statistical hypothesis, null hypothesis (H0,), alternative
hypothesis (H1), test statistic, significance level, and critical value
• Describe p-value and its interpretation
• Differentiate statistical significance from practical significance
• Describe the student’s t-test
Introduction to Significance Test

• In previous sessions we dealt with the estimation of population parameters by sample
statistics. These sample statistics may further be utilized to answer questions about the
population parameters.
• In the framework of statistical inference the question is reduced to a hypothesis and the
answer is expressed as the result of a test of the hypothesis.
Definition of Terms
• Statistical hypothesis: A statement about the parameter(s) or distribution of the
population(s) being sampled.
• Null hypothesis (H0): A term describing the particular hypothesis being tested.
o In many instances it is formulated for the sole purpose of being rejected or nullified. It
is often a hypothesis of ‘no difference’.
• Alternative hypothesis (H1): A statistical hypothesis that disagrees with the null
hypothesis.
o The null hypothesis H0 and the alternative hypothesis H1 concern populations, but
our conclusions are based on samples taken from these populations.
o Generalizing from a sample to the population can be dangerous due to sampling
errors. Therefore, we are unable to say that H0 or H1 is definitely true.
o If sampling errors are taken into account, we can investigate the likelihood that the
null or alternative hypotheses are true. We have to measure the relevant information
in the sampled data and weigh this information in relation to the sampling errors
involved.
• Test statistic: A statistic which represents the relevant sample information for the
question under investigation.
o The test statistic provides a basis for testing a statistical hypothesis and has a known
sampling distribution with tabulated percentage points (e.g. standard, normal, etc).
The value of a test statistic differs from one sample to another.
• Significance level: An arbitrary cut-off point which gives some small probability for
deciding when to declare the null hypothesis untenable or false.
• Critical value: A cut-off value corresponding to a given significance level, as determined
from the sampling distribution of the test statistic (by using statistical tables).
o The critical value is the boundary value such that if the value of the test statistic is
more extreme (i.e. more unlikely) than the critical value, then H0 is rejected and the
probability of rejecting H0 when it is true is less than the significance level.
Session 6: Significance Tests of One Sample 53
The p-Value
• The p-value is the probability that the differences as large (or larger) as we have
observed could have occurred simply by chance.
Interpretation of the p-Value

• Large p-values point to the null hypothesis (H0). Small p-values are evidence for the
alternative hypothesis (H1). A proposed guideline is:
Figure 1: p-value Interpretation

P-Value Interpretation
p ≥ 0.05 No evidence against H0
0.01 ≤ p < 0.05 Evidence in favour of H1, but be careful
0.001 ≤ p < 0.01 Substantial evidence in favour of H1. The possibility
that H0 is true can be neglected.
p < 0.001 Very strong evidence in favour of H1. The possibility
that H0 is true can be neglected.
• Sample size is important in the interpretation of p-values.

o For a proper interpretation of the p-value the sample size should be considered. If the
sample size is too small the sampling error will be large.
o This will prohibit us from finding evidence against H0 and result in high p-values,
even if H0 is not true.
Figure 2: The Relationship between the p-Value and the Sample Size
Sample size
P -Value Small Large
Small Evidence against H0 Evidence against H0
Results point away from H0 Results support H1
Large Difficult to interpret No evidence against H1
Can’t distinguish between H0 and H1 Results point at H0
• The following results relate to malnutrition among under-fives in Dodoma and Mwanza
using different sample sizes confirm the above explanation. We can observe that as
sample size increases, the difference between populations in Dodoma and Mwanza
becomes significant.
Figure 3: Malnutrition in under-fives in Dodoma and Mwanza: p-Values and Sample Size
SN Dodoma Mwanza p-Value Conclusion
50 40% 30% 0.29 No significant difference
500 40% 30% 0.0009 Highly significant
50,000 31% 30% 0.0006 Highly significant
Statistical Significance and Practical Significance

• There are many situations in which a result may reveal a statistically significant
difference which might be quite unimportant clinically.
• For example, in a study to compare blood pressure in the left and right arms, a small
difference of about 1 mm Hg was found. This difference was highly statistically
significant but of no importance clinically.
• Similarly, it is not reasonable to take a non-significant result as indicating no effect, just
because we cannot rule out the null hypothesis.
Significance Tests from Confidence Intervals

• Recall that a confidence interval for a population parameter (µ or P) provides limits that
have a high probability of including the unknown parameter (i.e., 95% confidence
interval).
• When we perform a significance test of a particular hypothesis, we are essentially asking
whether the population parameter takes a particular value.
• These two approaches are related.
o If the 95% confidence interval includes the value of the parameter proposed by the
hypothesis then the result of the test must be non-significant at the 5% level (i.e. p >
0.05).
o If the 95% confidence interval does not include the value of the parameter specified in
the null hypothesis, then the result of the test must be significant at the 5% level (i.e. p
< 0.05).
Instructions
Tutor will use the example below to illustrate the relationship between confidence intervals
and significance tests.
Example
Consider a study in which there are two groups of patients, randomized to a group receiving
a new treatment for a medical condition (i.e., treatment group), and into a control group. We
know that the mean survival time of the patients after being treated by the new technique is
46.9 months. From this, we have determined that the standard error is 4.33 months. (That is,
x=46.9 moths, and SE(x) = 4.33 months.)
A 95% confidence interval for the true mean survival time due to this new technique can be
calculated as shown below:
95% CI = 46.9 ± 1.96 x SE(x) = 46.9 ± 1.96 x 4.33 = 46.9 ± 8.49

Therefore, 95% CI = 46.9 – 8.49 to 46.9 + 8.49, that will consequently give you 38.41 to
55.39
The value proposed in the null hypothesis is 38.3 months. Note that it is not included in the
confidence interval.
Therefore, it can be concluded that 38.3 is an unlikely value for the mean survival time of
patients after treatment with the new techniques. This means that the null hypothesis is
rejected at the 5% level (i.e. p<0.05).
Student’s t-Test
• As already shown above, the standard normal deviate (SND) test involves the calculation
for variance
• The SND is then compared with the critical values 1.96 or 2.58.
• This was applied since the population standard deviation (σ) was known.
• If σ is unknown, the SND cannot be calculated.
• However, the value of σ can be estimated from the sample by the standard deviation s.
• Replacing σ in the below formula with s, we obtain a new quantity t, given that it follows
the t-distribution with (n-1) Degrees of Freedom (df).
∑ (x- x )2
2
σ = ______
n
After replacing σ with s we get:
∑ (x- x )2
s2 = ______
n
• As the sample size increases, s should be nearly equal to σ and t will be very close to the
standard normal deviate.
Refer to Handout 6.1: Student’s t-Test Table
Instructions
Tutor will provide example of how to calculate standard error and confidence interval.
Example
The following data are uterine weights (in mg) of each of 20 rats drawn at random from a
large stock. Is it likely that the mean weight for the whole stock could be 24 mg, a value
observed in some previous work?
9 18 21 26 14 18 22 27 16 30
15 19 22 29 15 19 24 30 24 32
In this problem:
n=20
µ = 24
∑ x = 430
x = 430 / 20 = 21.5
∑ x2 = 9984
s2 = 38.895
s = 6.237
The null hypothesis is that the mean weight for the whole stock is 24 mg. (µ0=24)
21.4 − 24
Therefore, t = = 1.891
1.3219
The df are (20 – 1) = 19.
We then consult the Student’s t-Table. We find that the value of t at p=0.05 and df=19 is
2.093.
Since 1.891 < 2.093, then p >0.05.
There is not sufficient evidence to suggest that the mean uterine weight of the stock is
different from 24 mg. The 95% confidence interval for the true mean is:
x ± µ (0.05, 19) × SE( x )
21.5 ± 2.093 × 1.395 = 18.2 to 23.8
The inclusion of the value 24 corresponds to the non-significant result of testing this value at
the 5% level.
Caution on the use of Statistical Test

• ‘Not significant’ does not necessarily mean ‘no association’.
• Statistical tests do not take into account biases in the data.
• Statistical significance does not imply a cause-and-effect relationship.
• Statistical significance does not imply public health significance. It says nothing about
the magnitude of an association.
Activity: Small Group Work
Instructions
You will work in small group to calculate significance level and confidence limits. Record
your answers neatly on a paper. One group will present their answers, and other groups will
contribute to the discussion drawing from their experiences.
Refer to Worksheet 6.1: Calculating Significance Level and Confidence Limits
Key Points
• For a proper interpretation of the p-value the sample size should be considered.
• If the sample size is too small the sampling error will be large. This will prohibit us to
find evidence against H0 and result in high p-values, even if H0 is not true.
• Sample size is important in the interpretation of the p-value.
Evaluation
• What do p<0.05 and p>0.05 mean?
• What is the distinction between statistical significance and practical significance?
• What is the use of the student t-test?
References
• Bonita, R., Beaglehole, R., Kjellstrom, T. (2006). Basic Epidemiology. (2nd ed). Geneva,
Switzerland: WHO
• Jones, et al. (2008). Biostatistics Workbook. Atlanta, USA: CDC Field epidemiology and
lab training program (FELTP).
• Makwaya, et al. (1997). Lecture notes in Biostatistics. Department of Epidemiology and
Biostatistics. MUHAS.
• McCusker, J. (2001). Epidemiology in Community Health (Rural Health Series, No. 9).
Nairobi, Kenya: AMREF.
• Rosner, B. (2006). Fundamentals of Biostatistics. (6th ed.). Belmont, CA: Thomson
Brookes/Cole.
• Varkevisser, et al. (1995). Designing and Conducting Health Systems Research Projects.
(Vol. 2, Part 2, Module 24). Health Systems Research Training Series.
Handout 6.1: Student’s t-Test Table
Degrees of Table of t
freedom p = 0.50 0.20 0.10 0.05 0.02 0.01
1 1.00 3.08 6.31 12.71 31.82 63.66
2 0.82 1.89 2.92 4.30 6.96 9.92
3 0.76 1.64 2.35 3.18 4.54 5.84
4 0.74 1.53 2.13 2.78 3.75 4.60
5 0.73 1.48 2.02 2.57 3.36 4.03
6 0.72 1.44 1.94 2.45 3.14 3.71
7 0.71 1.41 1.89 2.36 3.00 3.50
8 0.71 1.40 1.86 2.31 2.90 3.36
9 0.70 1.38 1.83 2.26 2.82 3.25
10 0.70 1.37 1.81 2.23 2.76 3.17
11 0.70 1.36 1.80 2.20 2.72 3.11
12 0.70 1.36 1.78 2.18 2.68 3.05
13 0.69 1.35 1.77 2.16 2.65 3.01
14 0.69 1.35 1.76 2.14 2.62 2.98
15 0.69 1.34 1.75 2.13 2.60 2.95
16 0.69 1.34 1.75 2.12 2.58 2.92
17 0.69 1.33 1.74 2.11 2.57 2.90
18 0.69 1.33 1.73 2.10 2.55 2.88
19 0.69 1.33 1.73 2.09 2.54 2.86
20 0.69 1.33 1.72 2.09 2.53 2.85
21 0.69 1.32 1.72 2.08 2.52 2.83
22 0.69 1.32 1.72 2.07 2.51 2.82
23 0.69 1.32 1.71 2.07 2.50 2.81
24 0.68 1.32 1.71 2.06 2.49 2.80
25 0.68 1.32 1.71 2.06 2.49 2.79
For more than 25 degrees of freedom, calculate the value of t that is required for each level of
significance from the expression: a+ (b/degree of freedom), where a and b have the values set
out below. As examples of the calculation the values of t needed for significance are given
for 30 and 40 degrees of freedom. Thus for P=0.05 the value of t needed is 1.96 +
(2.50/30)=2.04 for 30 degrees of freedom and 1.96 + (2.50/40)=2.02 for 40 degrees of
freedom.
Degrees of
P = 0.50 0.20 0.10 0.05 0.02 0.01
freedom
a 0.67 1.28 1.65 1.96 2.33 2.58
b 0.26 0.86 1.58 2.50 3.98 5.30
30 0.68 1.31 1.70 2.04 2.46 2.76
40 0.68 1.30 1.69 2.02 2.43 2.71
Source: Makwaya et. al. (1997): Lecture notes in Biostatistics. MUCHS. Tanzania.
Worksheet 6.1: Calculating Significance Level and Confidence
Limits
Instructions:
Work together in your small group to answer the following questions on a sheet of paper.
Select a recorder in your group, and record your answer and calculations.
Question:
The mean level of prothrombin in the normal population is known to be 20.0 mg/100 ml of
plasma and the standard deviation is 4 mg/100 ml. A sample of 40 patients showing vitamin
K deficiency has a mean prothrombin level of 18.5 mg/100 ml.
a) How reasonable is it to conclude that the true mean for patients with vitamin K deficiency
is the same as that for a normal population?
b) Within what limits would the mean prothrombin level be expected to lie for all patients
with vitamin K deficiency? Give the 95% confidence limits.
Answer:
Session 7: Chi-Square (χ 2) Test
Learning Objectives
• Define chi-square test
• Describe chi-square test of 2-by-2 table
• Demonstrate calculation of chi-square from a 2-by-2 table
• Describe chi-square test for a larger contingency table
Introduction to Chi-Square (χ2) Test

• χ is the Greek letter chi (pronounced ‘kye’).
• A χ test is used to determine whether a set of frequencies follow a particular distribution
(e.g. binomial, normal, poisson).
• In its basic form, it tests whether the observed frequencies of individuals with some
characteristics are significantly different than those expected for some hypothesis.
Definition
• Chi-square: A test used to find out whether observed differences between proportions of
events in groups may be considered statistically significant.
The Chi-square Test of the 2-by-2 Table

• Consider the previous examples comparing two proportions from previous lectures.
• The results of the clinical trial in which the proportions of patients dying who received
either treatment A or B were compared, can be presented in a 2 x 2 table as follows:
Figure 1: Survival Outcomes by Treatment Used

Treatment Outcome
Total
Died Survived
A 41 216 257
B 64 180 244
Total 105 396 501
• This table is called a ‘2-by- 2 contingency table’ because there are 2 rows and 2 columns.
o In general, we can have any ‘r × c’ contingency table, with ‘r’ rows and ‘c’ columns.
• From the above table, the observed frequencies are 41, 216, 64 and 180. We need to
obtain the expected frequencies under the null hypothesis that: ‘There is no difference in
outcome for patients receiving Treatment A and Treatment B.’
o In contingency table problems, the expected frequencies are the frequencies that you
would predict (‘expect’) in each cell of the table, if you were only provided with the
row and column totals (assuming that the variables under comparison were
independent).
o The expected frequencies are calculated in the following way, where E = expected
frequency:
Row Total × Column Total
E=
Grand Total
Session 7: The Chi-Square (χ 2) Test 61
• For example, in the top left cell, where we observe 41 deaths, the expected frequency
under the null hypothesis is:
105 × 257
= 53.86
501
• These expected frequencies are shown in the table below.
• They add up to the same grand total as the observed frequencies.
• We can then compare the observed and the expected frequencies by looking at their
differences.
Figure 2: Steps to Obtain Chi-square Value from a 2-by-2 Table

Observed (O) Expected (E) O-E (O-E)2/E
41 53.86 - 12.86 3.07
216 203.14 12.86 0.81
64 51.14 12.86 3.23
180 192.86 - 12.86 0.86
501 501.00 0.00 7.97
• We need also to consider the importance of the magnitude of the differences relative to
the expected values.
o For example, a difference of 5 between 995 and 1000 is not as important as the
discrepancy of 5 between 2 and 7.
Obtaining a Chi-square Value

• The chi-square value is obtained using the following formula for each of the four cells in
the contingency table and then summing them. The general formula for χ2 is (where O =
observed value, and E = expected value):
(O – E)2
χ2 =
E
• The chi-square values depend on the degrees of freedom.

o If a contingency table has ‘r’ rows and ‘c’ columns then:
df = (r–1)(c–1)
Refer to Handout 7.1: Table of Chi-Square (χ2) Test Values.
• The percentage points of the chi-square distribution are provided in Handout 7.1.
• From our example, we can determine that:
df = (2-l) (2-l) = l. Therefore from the above table, χ2 = 7.97 with l degree of freedom
(df).
• The Chi-Square table in Handout 7.1 shows that the observed value of 7.97 is beyond the
0.01 point of the Chi-square distribution.
• Therefore p is < 0.0l. We can conclude that the difference between the two treatments is
significant.
• A short cut formula for computing χ2 for a 2-by-2 table is given as follows:
Figure 3: Obtaining Chi-square Value Using the Degree of Freedom
Row
Variable y Variable x
marginal total
X1 X2
Y1 a b r1=a+b
Y2 c d r2= c+d
Column
s1 = (a+c) s2 = (b+d) n=(a+b+c+d)
marginal Total
Using the above information in the cells, chi-square is calculated by:

(ad – bc)2 × n
χ2 =
r1r2s1s2
Activity: Small Group Activity
Instructions
. Refer to:
• Handout 7.1: Table of Chi-Square (χ2) Test Values
• Handout 7.2: Steps in Calculating Chi-Square (χ2) Test
• Worksheet 7.1: Calculate the Chi-square Test
You will work in small groups to calculate Chi-square values. Read the question in
Worksheet 7.1, and refer to Handout 7.1 and Handout 7.2 for guidance as needed. One group
will present their solution in plenary, and others share in the discussion.
Chi-square Test for Larger Contingency Tables
Instructions
Tutor will provide an example of the Chi-square Test for larger contingency tables.
The following data show a sample of 10-year-old children classified according to the state of
oral hygiene and type of school attended.
Oral Hygiene among 10-year-olds, by type of school
Oral Hygiene
Total
Type of School Good Fair+ Fair- Poor
Below average 62 103 57 11 233
Average 50 36 26 7 119
Above average 80 69 18 2 169
Total 192 208 101 20 521
ASK students: ‘What is the null hypothesis?’

Answer: Ho is that there is no association between oral hygiene classification and type of
school attended. (That is, the proportions of children attending below average, average and
above average schools are the same in children with good, fair+, fair –, or poor oral hygiene.)
The expected numbe of children with good oral hygiene attending below average schools in a
sample of 192 children is:
233 × 192
= 85.9
521
Similarly, the expected numbers of children attending below average schools out of 208
children with fair+ oral hygiene is:
233 × 208
= 93.0
521
Now, let’s look at the expected frequencies for oral hygiene by type of school.
Expected Frequencies for Oral Hygiene among 10-year-olds, by type of school

Oral hygiene
Total
Type of school Good Fair+ Fair- Bad
Below average 85.9 93.0 45.2 8.9 233
Average 43.8 47.5 23.1 4.6 119
Above average 62.3 67.5 32.7 6.5 169
Total 192 208 101 20 521
Recall the general formula for χ2:
χ =∑
2 (O − E)
2
E
Now, apply that to our data:
2 (62 – 85.9)2 (103 – 93.0)2 (2 – 6.5) 2
χ = + + … +
85.9 93.0 6.5
χ 2 = 6.65 + 1.08 +…+ 3.12 = 31.4
df = (3-1) (4-1) = (2)(3) = 6
Therefore, p <0.001.
Thus, we reject H0 and conclude that it is probable that there is an association between oral
hygiene and type of school attended.
The observed proportions of children with good oral hygiene is as follows:

Type of School Proportion with good oral hygiene
Below average 62/233 = 0.27
Average 50/119 = 0.42
Above average 80/169 = 0.47
Note that a large proportion of children with good oral hygiene attended above average
schools compared to those who attended below average schools.
Key Points
• The Chi-square test is only valid for comparing observed and expected frequencies. It is
not valid for other variable such as percentages, means and rates.
• The Chi-square test is not valid for cells with expected frequencies less than 5.
Evaluation
• Define ‘chi-square test’.
• What is the formula for calculating the chi-square test for a 2-by-2 table?
• What is the formula for calculating the Chi-square test for a larger contingency table?
References
• Makwaya, et al. (1997). Lecture notes in biostatistics. Department of Epidemiology and
Biostatistics. MUCHS.
• Varkevisser, et al. (1995). Designing and Conducting Health Systems Research Projects.
(Vol. 2, Part 2, Module 24). Health Systems Research Training Series.
Handout 7.1: Table of Chi-Square (χ2) Test Values
Table of Chi-square (χ2 ) Test

Degrees of
P = 0.50 P = 0.20 P = 0.10 P = 0.05 P = 0.02 P = 0.01
freedom (df)
1 0.45 1.64 2.71 3.84 5.41 6.63
2 1.39 3.22 4.61 5.99 7.82 9.21
3 2.37 4.64 6.25 7.81 9.84 11.34
4 3.36 5.99 7.78 9.49 11.67 13.28
5 4.35 7.29 9.24 11.07 13.39 15.09
6 5.35 8.56 10.64 12.59 15.03 16.81
7 6.35 9.80 12.02 14.07 16.62 18.48
8 7.34 11.03 13.36 15.51 18.17 20.09
9 8.34 12.24 14.68 16.92 19.68 21.67
10 9.34 13.44 15.99 18.31 21.16 23.21
11 10.34 14.63 17.28 19.68 22.62 24.72
12 11.34 15.81 18.55 21.03 24.05 26.22
13 12.34 16.98 19.81 22.36 25.47 27.69
14 13.34 18.15 21.06 23.68 26.87 29.14
15 14.34 19.31 22.31 25.00 28.26 30.58
16 15.34 20.47 23.54 26.30 29.63 32.00
17 16.34 21.61 24.77 27.59 31.00 33.41
18 17.34 22.76 25.99 28.87 32.35 34.81
19 18.34 23.90 27.20 30.14 33.69 36.19
20 19.34 25.04 28.41 31.41 35.02 37.57
21 20.34 26.17 29.62 32.67 36.34 38.93
22 21.34 27.30 30.81 33.92 37.66 40.29
23 22.34 28.43 32.01 35.17 38.97 41.64
24 23.34 29.55 33.20 36.42 40.27 42.98
25 24.34 30.68 34.38 37.65 41.57 44.31
26 25.34 31.79 35.56 38.89 42.86 45.64
27 26.34 32.91 36.74 40.11 44.14 46.96
28 27.34 34.03 37.92 41.34 45.42 48.28
29 28.34 35.14 39.09 42.56 46.69 49.59
30 29.34 36.25 40.26 43.77 47.96 50.89
If the number of degrees of freedom is greater than 30, calculate the value of the expression:
χ 2 – df
The level P associated with this value is (to a close approximation) as follows:
P = 0.50 0.20 0.10 0.05 0.02 0.01
Value 0.00 0.60 0.91 1.16 1.45 1.64
Example:
• Assume that χ2 = 49 and df = 36.
• The expression χ 2 – df = 7 – 6 = 1.0
• This is not quite significant at P=0.05 (where a value of 1.16 is required).
o Note: If χ2 = 55, the expression would give 7.42 – 6.00 = 1.42 which is highly
significant at almost p = 0.02 (where a value of 1.45 is required).
Source: Makwaya et. al. (1997): Lecture notes in Biostatistics. MUCHS. Tanzania
Handout 7.2: Steps in Calculating Chi-Square (χ2) Test
Chi-Square (χ2) Test

If you have categorical data, the chi-square test is used to find out whether observed
differences between proportions of events in two or more groups may be considered
statistically significant.
Example
Suppose that in a cross-sectional study of the factors affecting the utilization of antenatal clinics
you found that 64% of the women who lived within 10 kilometres of the clinic came for
antenatal care, compared to only 47% of those who lived more than 10 kilometres away. This
suggests that antenatal care (ANC) is used more often by women who live close to the clinics.
The complete results are presented in the table below:
Utilisation of Antenatal Clinics by Women Living Far From and Near the Clinic
Distance from ANC Used ANC Did not use ANC TOTAL
Less than 10 km 51 (64%) 29 (36%) 80 (100%)
10 km or more 35 (47%) 40 (53%) 75 (100%)
Total 86 69 155
From the table we conclude that there seems to be a difference in the use of antenatal care
between those who live close to and those who live far from the clinic (64% versus 47%). We
now want to know if this observed difference is statistically significant or not.
The chi-square test can be used to give us the answer. This test is based on measuring the
difference between the observed frequencies and the expected frequencies if the null
hypothesis (i.e. the hypothesis of no difference) were true.
To perform a χ2 test you need to complete the following 3 steps:

1. Calculate the χ2 value
2. Use a χ2 table
3. Interpret the χ2
Step 1: Calculate the χ2 value

a) Calculate the expected frequency (E) for each cell.
o To find the expected frequency (E) of a cell you multiply the row total by the
column total and divide by the grand (overall) total:
Row Total × Column Total
E=
Grand Total
b) For each cell, subtract the expected frequency from the observed frequency (O - E).
c) For each cell, square the result of (O - E) and divide by the expected frequency E.
d) Add the squared results (calculated in step c) for all the cells.
e) The formula for calculating a chi-square value (steps b to d) is:
2 (O – E)2
χ =
E
Continued on next page
o O is the observed frequency (indicated in the table)
o E is the expected frequency (to be calculated)
o ∑ means ‘sum of’ and directs you to add together the values of [(O - E)2 /E] for all the
cells of the table.
o For a 2-by-2 table (which contains 4 cells) the formula is:
2 (O1 – E1)2 (O2 – E2)2 (O3 – E3)2 (O4 – E4)2
χ = + + +
E1 E2 E3 E4
Step 2: Using a χ2 Table
As for the t-test, the calculated χ2 value has to be compared with a theoretical χ2 value in order
to determine whether the null hypothesis is rejected or not.
Note: Handout 7.2 Table of Chi-Square (χ2) Test values contains a table of theoretical χ2
values.
a) First, decide what significance level you want to use (alpha or α-value). We usually take
0.05.
b) Then, calculate the degrees of freedom. With the χ2 test the number of degrees of freedom
is related to the number of cells (i.e., the number of groups you are comparing).
o The number of degrees of freedom is found by multiplying the number of rows (r)
minus 1 by the number of columns (c) minus 1:
df = (r–1)(c–1)
o For a simple two-by-two table the number of degrees of freedom is 1:
df = (2–1)(2–1) = 1
c) The χ2 value belonging to the α-value and the number of df are located in the table. If the
calculated χ2 value is equal to or larger than the χ2 value from the table, then the p-value
is smaller than the chosen level of significance (α-value).
d) In this case, we reject the null hypothesis and conclude that there is a statistically
significant difference between the groups. If the calculated χ2 value is smaller than the χ2
value from the table, then the p-value found is larger than the chosen significance level of
0.05. In this case, we accept the null hypothesis and conclude that the observed difference is
not statistically significant.
Worksheet 7.1: Calculate the Chi-square Test
Instructions
• Work in small groups to complete the following worksheet.
• Refer to Handout 7.1: Table of Chi-square (χ2 ) Test and Handout 7.2: Steps in
Calculating Chi-Square (χ2) Test as needed.
Question
In the study of the factors affecting the utilization of antenatal clinics found that 64% of the
women lived within 10 km of the clinic came for antenatal care, compared to only 47% of
those who lived more than 10 km away. This suggests that antenatal care is used more often
by women who live close to the clinics. The complete results are presented below:
Utilization of Antenatal Clinic by Women Living Far From and Near the Clinics
Distance from Used ANC Did not use Total
ANC ANC
Less than 10 km 51 (64%) 29 (36%) 80 (100%)
10 km or more 35 (47%) 40 (53%) 75 (100%)
Total 86 69 155
From the table we determine that there seems to be a difference in utilization of antenatal care
between those who live close to and those who live far from the clinic. We want to know
whether this observable difference is statistically significant.
Please calculate the χ2 value. Use Handout 7.1: Table of Chi-square (χ2 ) Test to interpret
the results.
Answer:
Session 8: Source and Uses of Morbidity and
Mortality Statistics
Learning Objectives
• Define vital statistics and demography
• Describe key sources of demographic data
• State the definition of rate, ratio and proportion
• Describe the measures of fertility, morbidity and mortality rates
Introduction to Vital Statistics and Demography

Definitions
• Demography: The study of the structure of human populations using statistics relating to
births, deaths, wealth, disease, etc.
• Vital statistics: Quantitative data concerning the population, such as the number of
births, marriages, and deaths.
• Census: A systematic, routine way of counting subjects in a defined boundary or limits of
land. A census produces reports of individuals, population size and structure at a point in
time.
Sources of Data
• Quality of data depends on many factors, one of which is the source of data.
• Sources of data have a direct implication on information quality in terms of coverage,
completeness and cost.
• In this session we will concentrate on the following sources of demographic data:
o Census
o Vital registration systems
o Sample surveys
Census
• The main characteristic of census is that it covers the whole population.
• Although commonly limited to population, a census can be used to quantify any number
of items in a category.
o For example, recorded censuses have been found of agriculture, business, livestock,
housing, etc., sometimes done concurrently with population census.
• No sampling is involved and each person is enumerated separately.
• A census must have a legal basis to make it complete and compulsory.
• A census reflects a single point in time (such as 1-January-2010), although the whole
process of data collection/enumeration can take a longer time.
• Basic questions which should appear on a census questionnaire include:
o Name, age, sex, relationship to the head of household, marital status, race, religion,
ethnicity, education level, occupation, employment status, migration status, and
household amenities.
• Additional questions would depend on the availability and quality of vital registration.
• A population census can be carried out using either de facto or de jure method.
o De Facto Census
Session 8: Source and Uses of Morbidity and Mortality Statistics 71
De facto method enumeration designates persons to the area or location they are
found during enumeration (i.e., it enumerates the population ‘in fact’ there at the
time of the census, regardless of the location of their legal or permanent/normal
residence.) The question of originality does not count here.
For example, in Tanzania’s 1988 Population Census, Zanzibar had a population of
641,000. This implies that 641,000 people spent a night in Zanzibar before a
census night.
Tanzania follows de facto census enumeration.
o De Jure Census
The de jure method of enumeration allocates persons to their normal/usual
residence. That is, the census counts people who belong to an area or have the
right to live there through citizenship, legal residence, etc. For example, a
businessman working in Dar es Salaam but living in Arusha would be assigned to
Arusha.
• In Tanzania, a census is normally conducted every ten years (decennial).
o This creates some setbacks and implications for planning, because population can
change rapidly as a result of births, deaths, and migration/movement.
• To overcome this problem, inter-censal surveys or mini-surveys are conducted.
Examples of such surveys are the Tanzania Demographic and Health Survey (TDHS),
conducted approximately every 5 years.
• Further surveys on morbidity and for specific diseases (e.g., maternal mortality,
HIV/AIDS, childhood malnutrition, etc.) can be conducted whenever a need arises.
Vital Registration
• Vital registration systems are common in developed countries where information on
births, marriages, deaths and migrations are collected. In developing countries, vital
registration systems are often incomplete, unreliable, or non-existent.
• Questions in the vital registration system are always very simple and few.
o Consider hospital or health service data here in Tanzania. Examples of such
registrations are information on deaths found in hospitals (death certificates), birth
and marriage data found in churches, mosques and District Commissioners’ offices
and migration data found at airports and borders.
• The shortfall of vital registration systems is that they are often incomplete, selective
samples, and are practically unreliable. This does not mean that the system should be
discarded; instead it should be improved to remove these errors.
Sample Surveys
• Sample surveys give the same information in a more detailed form when a reliable vital
registration system does not exist.
• Only a sample of the population is involved; thus, sample surveys are less costly than a
complete census. In addition, information can usually be collected more quickly in a
sample survey than in a census.
• Sample surveys allow more detailed, nuanced data collection than census systems.
• One key disadvantage to sample surveys is the error introduced through sampling.
Ratio, Proportion and Rate

• Ratio, Proportion, and Rate are concepts that are critical to epidemiology and vital
statistics. It is important to understand the differences between them.
Ratio
• Any number (numerator) divided by any other number (denominator) gives a ratio.
• For example:
X
Y
is a ratio, where X is the numerator and Y is the denominator. X and Y do not need to
have the same units.
• Sex Ratio at Birth is a commonly used ratio in epidemiology and vital statistics.
No. of male births
= Sex Ratio at Birth
No. of female births
Proportion
• A proportion is a special form of a ratio only that in a proportion the numerator is part of
a denominator.
• For example:
o Proportion of females among first-year MUCHS students
No. of females in 1st year
Total no. of 1st-year students
o Proportion of male births

No. of male births
Total no. of births
• A proportion is often expressed as a percentage.
Rate
• A rate is a proportion with the added dimension of time.
• A population must be studied throughout a specified time period (e.g., 1 year), during
which the frequency of an event of interest (e.g., disease, death, etc.) is counted.
• A rate indicates the frequency of events occurring in a population per unit of time.
• For example:
o The death rate per year is given by the number of deaths during the year, divided by
the number of person-years of exposure to the risk of death.
No. of deaths in one year
Crude Death Rate = × 1,000
Total population
• Rates may be expressed per 1,000; per 100,000; or per 1,000,000 population depending
on convention and convenience.
Measures of Fertility, Morbidity and Mortality
Measures of Fertility
• Common measures of fertility include:
o Crude Birth Rate (CBR)
o General Fertility Rate (GFR)
o Total Fertility Rate (TFR)
o Gross Reproductive Rate (GRR)
Crude Birth Rate (CBR)
• CBR is called a ‘rate,’ but in practice it is a ratio.
• The rate is ‘crude’ because it does not take into account the risk of giving birth according
to age and sex differences.
• CBR is defined as:

No. live births in one year
Crude Birth Rate = × 1,000
Total population
Fertility Rate/General Fertility Rate (GFR)

• General Fertility Rate (GFR) is more widely accepted than CBR, and is considered more
a more conventional and modern measure of fertility.
• The denominator is restricted to women at risk of child-bearing rather than the general
population.
• It is often known simply as ‘fertility rate’.
• GFR is defined as:

No. live births in one year
General Fertility Rate = × 1,000
Mid-year population of women aged 15-49
The Total Fertility Rate (TFR)

• TFR is the average number of children a woman would have during her reproductive
lifetime, given that current specific fertility rates would still be applicable at that time.
• The total fertility rate is calculated from Age-Specific Fertility Rates (ASFRs).
• We get the ASFRs when we divide the number of live births by the number of women in
each age interval.
• Unlike the CBR and GFR, the calculation of TFR greatly depends on the age composition
although its use is independent of age distribution.
Instructions
Tutor will provide an example of the steps required to calculate Total Fertility Rate (TFR).
Example
Figure 1: Number of Live Births And Maternal Age, Tanzania, 1988
Age No. of Women No. of live births Age-Specific Fertility Rate
(No. live births/No. of women)
15-19 665,000 21,000 0.0316
20-24 516,000 114,000 0.2209
25-29 459,000 118,000 0.2571
30-34 344,000 123,000 0.3576
35-39 310,000 37,000 0.1194
40-44 229,000 6,000 0.0262
45-49 218,000 5,000 0.0229
Total 2,741,000 424,000 1.0357
• Age-specific fertility rates are calculated by dividing the no. of live births by no. of
women in each age cohort.
Activity continued on the next page
• For women ages 15-19, ASFR = 21,000 / 665,000 = 0.316
The TFR is the sum of all age specific fertility rates.
• TFR = 1.0357 × 5 = 5.1785
• The sum of all ASFRs is multiplied times 5 because of the 5 year age group interval.
• If ages are in single years, then there is no need to multiply this sum.
• The figure 5.1785 means that on average, each woman will have 5 children during her
reproductive period (assuming that these age-specific fertility rates will still apply until
she finished her reproductive life).
Gross Reproductive Rate (GRR)

• GRR is the average number of daughters a woman would have if she survived to at least
age 50 and experienced the given female ASFRs.
• GRR is similar to the TFR only that it considers female live births rather than all births.
This implies that ASFR for GRR is based on females.
• A GRR of 1 means that women are able to replace themselves, while a GRR of 2 means
that the population is doubling itself: each woman is on average producing two daughters.
Like the TFR, GRR is a hypothetical measure.
• It is a period measure which does not take into account the effect of female mortality
either before age 15 or 15 to 50 years.
Instructions
Tutor will use example below to show the steps required to calculate General Reproductive
Rate (GRR).
Example
Figure 2: Number of Female Live Births And Maternal Age, Tanzania, 1988
Age No. of Women No. of live births No. female births Female ASFR
15-19 665,000 21,000 11,000 0.0165
20-24 516,000 114,000 58,000 0.1124
25-29 459,000 118,000 60,000 0.1307
30-34 344,000 123,000 63,000 0.1831
35-39 310,000 37,000 19,000 0.0613
40-44 229,000 6,000 3,000 0.0131
45-49 218,000 5,000 3,000 0.0138
Total 2,741,000 424,000 217,000 0.5309
• Female ASFRs are calculated by dividing the no. of live female births by no. of women in
each age cohort.
• For women ages 15-19, Female ASFR = 11,000 / 665,000 = 0.0165
• We calculate GRR as follows:
GRR = 0.5309 × 5 = 2.6545.
• Note: If the true sex ratio at birth is known, the GRR can be calculated using the TFR.
• Remember that in Exercise 1, we found that the TFR was 5.1785.
• Calculate GRR using the TFR and the sex ratio, as follows:
217,000
GRR = 5.1785 × = 2.65
424,000
Measures of Morbidity
Incidence Rates
• Incidence is a measure of the risk of developing a disease/condition within a specified
time period.
• The incidence rate is the number of new cases per population in a given time period (i.e.,
the rate of contracting a disease among those still at risk).
• Incidence rate is expressed as follows, where k = 2, 3, 4, 5 or 6 depending on the
convenience or convention:
No. of new cases of disease in a period of time

× 10k
Total midyear population at risk of acquiring the disease
Prevalence Rates
• The prevalence of a disease is the total number of existing cases among the entire
population.
• It can be measured at an instant time (point prevalence) or looking for cases over a stretch
period of time (period prevalence).
• Point prevalence is expressed as follows, where k = 2, 3, 4, 5 or 6 depending on the
No. of subjects with the disease at time

Point Prevalence Rate = × 10k
Total population at time t
• Period prevalence is expressed as follows, where k = 2, 3, 4, 5, or 6 depending on the

Total no. of persons with disease during a period

Period Prevalence Rate = × 10k
Total population at mid – point of the interval
• This index is prone to bias because cases with long duration have a higher probability of
being in the sample than those with short duration.
Case Fatality Rate

• Case fatality rate is number of deaths within a designated population of people with a
particular disease, over a specified period of time.
o For example, the number of malaria cases that died within the last 2 years.
Specific Rates
• Specific rates apply to:
o Defined geographic areas
o Defined age groups
o Different sexes (male, female)
o Defined socio-economic characteristics (e.g., education level, marital status, etc.)
• They are called rates to that specification.
Measures of Mortality (Death)

• There are several key measures that are used to express mortality. The measures and
mathematical expressions are detailed below.
• Crude Death Rate (CDR)
Total population
• Note: When the denominator is approximated by the ‘total population’, then the index
obtained is not the actual crude death rate, but rather a ‘crude mortality ratio.’
• Infant Mortality Rate (IMR)

No. of deaths of infants under 1 year in time period
IMR = × 1,000
No. of live births in time period
o Infant mortality rate is often broken down into several indices depending on the age
categories of an infant.
o Generally, these rates are expressed as the number of deaths per 1,000 live births.
Refer to Handout 8.1: Infant Mortality Rate (IMR) Measures

• Maternal Mortality Ratio (MMR)
Maternal No. maternal deaths in a given year
Mortality = No. of live births + no. of stillbirths in time period × 100,000
Ratio
o Maternal death is defined as the death of a woman while pregnant or within the 42
days after termination of that pregnancy, regardless of the length and site of the
pregnancy, due to any cause related to or aggravated by the pregnancy itself or its care
but not due to accidental or incidental causes.
Key Points
• The main sources of statistical data are census, vital registration and sample surveys.
• It is important to be able to distinguish between rates, ratio and proportion.
• Measures of fertility, morbidity and mortality are expressed in rates, ratio or proportion.
Evaluation
• Describe the gross reproductive rate.
• Define prevalence rate.
• Define incidence rate.
• What is the neonatal mortality rate?
References
• Bonita, R., Beaglehole, R., Kjellstrom, T. (2006). Basic Epidemiology. (2nd Edition).
Geneva, Switzerland: WHO
• Makwaya, et al. (1997). Lecture notes in biostatistics, Department of Epidemiology and
Biostatistics, MUCHS.
• Rosner, B. (2006). Fundamentals of Biostatistics. (6th edition). Belmont, CA: Thomson
Brookes/Cole.
Handout 8.1 Infant Mortality Rate (IMR) Measures
• Infant Mortality Rate (IMR)

No. of deaths of infants under 1 year in time period
IMR = × 1,000
o Infant mortality rate is often broken down into several indices depending on the age
categories of an infant.
o Generally, these rates are expressed as the number of deaths per 1,000 live births.
• Neonatal Mortality Rate (NMR)

No. deaths of infants under 28 days in time period
NMR = × 1,000
• Early Neonatal Mortality Rate (ENMR)

No. deaths of infants aged under 1 week in time period
ENMR = × 1,000
• Late Neonatal Mortality Rate (LNMR)

No. deaths of infants aged 1-4 weeks in time period
LNMR = × 1,000
• Post Neonatal Mortality Rate (Post-NMR)

No. deaths of infants aged 4-52 weeks in time period
Post-NMR = × 1,000
• Stillbirth Rate
No. of stillbirths in time period
Stillbirth Rate = × 1,000
No. of live births + no. of stillbirths in time period
• Perinatal Mortality Rate

Perinatal No. of still and neonate deaths in time period
= × 1,000
Mortality Rate No. of live births in that same year
o This index is important because it documents fetal and neonatal death during or very
soon after delivery. It includes neonates that are born dead or alive.
Session 9: Introduction to Epidemiology
Learning Objectives
• Define the concepts of epidemiology, health and disease
• Identify key applications and achievements of epidemiology
• Describe two key epidemiological theories of disease causation
• Explain the determinants of health and disease
Definition of Concepts
Definition
• Epidemiology: The study of the distribution and determinants of health-related states or
events in specified populations, and the application of this study to the control of health
problems.
• Key elements of this definition can be further understood as follows:
o Study = basic science
o Distribution = time, place, person
o Determinants = cause, risk factors
o Event = health status
o Population = public health
o Application = information for action
• Three closely-related components (distribution, determinants and frequency) encompass
all epidemiological principles and methods.
• Epidemiology is a multidisciplinary subject which has borrowed from the fields of
demography, statistics, sociology and other sciences to become a distinct discipline with
its own philosophy.
Distribution: Descriptive Epidemiology

• What, who, when, and where
• Frequency: number, rates, and risk
• Quantify diseases to determine magnitude
• Patterns: time, place, and person
Determinants: Analytic Epidemiology

• Why and how
• Causes and influences
• Evidence for control and prevention
• Compare between exposure groups to determine causal relationships
Health-related Events
• Epidemic communicable diseases
• Endemic communicable diseases
• Non-communicable Diseases
• Chronic Diseases
• Injuries
Session 9: Introduction to Epidemiology 79
• Maternal and Child Health
• Occupational, and Environmental Health
• Health Behaviors
Definition of Health
• Health: A state of complete physical, mental, and social well-being and not merely the
absence of disease or infirmity. (World Health Organization)
• Health is more than just the absence of pain or discomfort. Good health is a dynamic
relationship between the individual, friends, family and the environment within which we
live and work.
Definition of Disease
• Disease: A disorder of structure or function in a human, especially one that produces
specific symptoms or that affects a specific part.
Definition of Reservoir
• Reservoir: The habitat in which disease-causing organisms normally live and multiply.
• Reservoirs can be human, animal, or environmental.
o Diseases with human reservoirs:
Smallpox (symptomatic)
HIV (asymptomatic)
o Diseases with animal reservoirs (also known as zoonoses):
Brucellosis (can be found in goats, sheep, cattle, pigs)
Plague (can be found in rats and other wild rodents)
Anthrax (can be found in cattle, sheep, goats, and other herbivores)
o Environmental
Histoplasmosis (caused by a fungus that is often found in areas with lots of
bird/bat droppings such as caves)
Legionnaires’ bacillus (caused by aquatic bacteria that grow in warm water)
• Note: a reservoir is different from a vector or disease carrier, which are agents of disease
transmission.
Applications and Achievements of Epidemiology

• Epidemiology is used to:
o Describe the etiological factors in causation of disease.
Poor health results from a complex interplay of environmental factors (including
lifestyle/health behavior and external environment) and genetic factors in a
healthy individual.
o Study the natural history of disease, from good health to subclinical changes until
occurrence of clinical disease, where the outcome can be recovery (with or without
disability) or death.
o Describe the health status of populations, and the distribution of diseases in a
population (in which the proportion of people with poor health changes with time and
age).
o Provide and analyze information for planning, implementation and evaluation of
health services; such as interventions directed at maintaining good health (including
health promotion, prevention, and medical management of ill people through the
provision of prompt treatment).
Achievements in Epidemiology
• The field of epidemiology has contributed to many advancements in disease prevention
and eradication. Two examples are described below.
Smallpox Eradication
• During the late 1970s, epidemiology played a central role in smallpox eradication.
• The World Health Organisation (WHO) coordinated intensive smallpox eradication
campaigns, informed by epidemiological data about the distribution of cases, and the
model, mechanisms, and levels of disease transmission.
• Data was accumulated by mapping outbreaks of the disease, and by evaluating control
measures.
• Ten years after the campaign to eradicate smallpox ended, reports confirmed that only
two countries had a reported smallpox case. One naturally occurring case of small pox
was also reported in the year 1977.
• Smallpox was declared eradicated in the year 1979.
Methylmercury Poisoning due to Environmental Pollution

• Epidemiology played a crucial role in identifying the cause of what was one of the first
reported epidemics of disease caused by environmental pollution, in 1956 in Japan.
• The first cases were thought to be due to infectious meningitis, but it was observed that
many patients with similar symptoms resided in (or had recently spent time in) villages
along Minamata Bay, in which the main livelihood/occupation was fishing.
• A survey of affected and unaffected people showed that the victims were almost
exclusively members of families whose main occupation was fishing. People who visited
these families who ate small amounts of fish did not suffer from the disease.
• It was concluded that something from the fish had poisoned the patients and that the
disease was not communicable.
• Eventually, researchers determined that industrial wastewater from a chemical factory
had released methyl mercury into Minamata Bay. The mercury accumulated in fish and
shellfish, which was later eaten by humans and animals.
• This was the first known outbreak of methyl mercury poisoning involving fish, and it
took several years of research until the real cause (methylmercury) was identified.
Epidemiological Theories of Disease Causation

Historical Roots
• Epidemiological concepts are rooted in the works of Hippocrates.
o Hippocrates wrote ‘Treatise on Air, Water, and Places.’
o Hippocrates suggested that it was important to consider a variety of environmental
influences on disease in humans, such as heat, cold, the winds, water quality
factors, individual habits of eating and drinking to excess, indolence/laziness, or
fondness for exercise and physical labour.
• These ideas provide a foundation for our modern understanding of epidemiology, and
how environment and behavior/lifestyle impact health. In addition, they emphasize the
importance of analyzing a variety of situations to investigate causal relationships in
disease occurrence, with the ultimate aim of disease prevention.
• Two theories emerged in the historical times which evolved in an attempt to investigate or
control epidemic disease, the miasma and contagium vivum theories.
The Miasma Theory
• This theory dating back to the early 1700’s offered an alternative explanation for the
origin of epidemics.
• The idea was based on the notion that when air is of bad odor or quality, persons
breathing that air would become ill (e.g, malaria, cholera, etc.).
• The miasma theory of disease did inspire many sanitation reforms in England in the
1800s; however, the theory was not supported by any scientific explanation was
abandoned.
The Contagium Vivum Theory

• According to this theory, a living contagion (or ‘contagious living fluid’) was thought to
be involved in disease. This theory necessarily depended on two other concepts:
o The specificity of both a disease and its cause
o The existence of disease-causing micro-organisms
• This theory developed in the late 1800s from the groundbreaking work of Robert
Koch, Louis Pasteur, and Martinus Willem Beijerinck.
• This theory is also known as the ‘germ theory.’
• The contagium theory is the scientific theory that holds, and it helped to justify many
environmental sanitation interventions and shape many epidemiologic studies.
Determinants/Factors Related to the Occurrence of Disease

• Determinant: A factor which determines the nature or outcome of something.
• The epidemiological patterns of infectious diseases depend upon factors that influence
probability of contact between an infectious agent and a susceptible person/host.
• This probability of contact is determined by an intricate interplay between factors related
to the triad of agent, host and environment.
• It is important to approach disease causation in this manner as it gives us an insight
necessary for developing rational preventive measures.
Examples of Determinants/Factors and the Associated Disease

• The following charts show some illustrative examples of determinants of disease.
Refer to Handout 9.1: Examples of Determinants/Factors Influencing Disease

Occurrence.
Figure 1. Factors related to the Agent

I. Nutritive Elements (excess or deficiency)
Factor Disease
Cholesterol Atherosclerosis
Calories Kwashiorkor
Protein Marasmus
Iodine Goitre
Iron and folic acid Anaemia
Continued on next page
II. Chemical Agents (presence of poisons or allergens)
Factor Disease
Pesticides Poisoning
Drugs, alcohol Intoxication/dependency
Allergens Eczema
Pollen Hay-fever
III. Physical Agents
Factor Disease
Energy, speed Accidents
Solar radiation Sun burns
Radioactivity Neoplasm
IV. Infectious Agents
Organism Factors Disease
Bacteria e.g. Mycobacterium tuberculosis Tuberculosis
Vibrio cholera Cholera
Viruses e.g. Measles Measles
Polio virus Poliomyelitis
Small pox virus Small pox
Ricketsia e.g. Rickettsia prowazeki Typhus
Rickettsia conorii Tick bite fever
Protozoa e.g. Plasmodium malariae Malaria
Entamoeba histolytica Amoebiasis
Trypanosoma Trypanosomiasis
Helminthes e.g. Schistosoma haemotobium Urinary schistosomiasis
Ascaris lumbricoides Ascaris
Trychophyton e.g. Trychophyton spp. Tinea corporis
Figure 2. Factors related to the Host

Factor Example(s)
Age Paralytic polio. The ratio of the paralytic cases to infections increases
with age (1:1000 among young children and 1:75 among adults).
Sex Males usually have higher poliomyelitis attack rate than females.
Genetic Persons with sickle cell trait are associated with decreased risk of
malaria due to plasmodium malaria.
Blood group A has increased risk of gastric cancer while group O have
increased risk for duodenal ulcers.
Ethnicity Certain ethnic groups have increased risk for keloid and gastro-
intestinal tract cancer.
Physiology Pregnancy – candidiasis
Puberty – goiter, stress, nutritional state and fatigue
Immunology Hypersensitivity, allergy
Active Prior infection, immunization
Passive Maternal antibodies
Existing Pathology Pre-existing disease may initiate another by interfering with immunity
(e.g. malaria and herpes simplex, or other concurrent disease)
Behavior Personal hygiene, religion, customs, habits, utilization of health
resources and related diseases.
Figure 3. Factors Related to the Environment
Factors Example
Physical Climate, geology, radiation, heat, light, air pollution
associated with chronic respiratory disease.
Biological Human Population density
Flora Food source, influence on disease agents/vectors
Fauna Influences presence of host/vectors and agents
Socio-Economic Occupation Occupational hazards
Urbanization Crowding, stress
Development Education, poverty, availability of health services
Disruption War and conflict (Rwanda 1994/5), natural disasters
(Haiti earthquake, 2010)
The Interaction of the Agent, Host and Environment in Disease Causation

• If the characteristics of the agent, host and environment are studied in isolation, we will
never reach a full understanding of why diseases are distributed the way they are.
• The interaction between agent, host and environment is the object of the study of ecology.
The study of this interaction will give clues to answers of questions like:
o How and why do new pathogens and/or new diseases emerge?
o Why does the spectrum of disease change over time?
o Is it possible to control and finally eradicate infectious disease?
Key Points
• Three closely interrelated components – distribution, determinants and frequency-
encompass all epidemiological principles and methods
• Epidemiology is a multidisciplinary subject which has borrowed from demography,
statistics, sociology and other sciences to become a distinct discipline with own
philosophy.
• Miasma and contagium theory are the main historical epidemiological theories of disease
causation.
• The determinants/factors and the associated disease are an interplay of host, agent and
environment.
Evaluation
• Define the following epidemiological concepts:
o Health
o Disease
• Name 2 major achievements of epidemiology.
• What is a determinant of a disease?
• Provide an example of determinants/factors influencing disease that is related to:
o Agent
o Host
o Environment
References
Switzerland: WHO
• Kapiga, S. et al. (1998). Lecture notes in epidemiology and research methodology.
Department of Epidemiology and Biostatistics. MUCHS.
• Rosner, B. (2006). Fundamentals of Biostatistics. (6th ed). Belmont, CA: Thomson
Brookes/Cole.
• WHO. (2003) WHO Definition of health. Retrieved from
http://www.who.int/about/definition/en/print.html/ (date unknown)
Handout 9.1: Examples of Determinants/Factors Influencing
Disease Occurrence
Figure 1. Factors related to the Agent
I. Nutritive Elements (excess or deficiency)

Factor Disease
Cholesterol Atherosclerosis
Calories Kwashiorkor
Protein Marasmus
Iodine Goitre
Iron and folic acid Anaemia
II. Chemical Agents (presence of poisons or allergens)
Factor Disease
Pesticides Poisoning
Drugs, alcohol Intoxication/dependency
Allergens Eczema
Pollen Hay-fever
III. Physical Agents
Factor Disease
Energy, speed Accidents
Solar radiation Sun burns
Radioactivity Neoplasm
IV. Infectious Agents
Organism Factors Disease
Bacteria e.g. Mycobacterium tuberculosis Tuberculosis
Vibrio cholera Cholera
Viruses e.g. Measles Measles
Polio virus Poliomyelitis
Small pox virus Small pox
Ricketsia e.g. Rickettsia prowazeki Typhus
Rickettsia conorii Tick bite fever
Protozoa e.g. Plasmodium malariae Malaria
Entamoeba histolytica Amoebiasis
Trypanosoma Trypanosomiasis
Helminthes e.g. Schistosoma haemotobium Urinary schistosomiasis
Ascaris lumbricoides Ascaris
Trychophyton e.g. Trychophyton spp. Tinea corporis
Figure 2. Factors related to the Host
Factor Example(s)
Age Paralytic polio. The ratio of the paralytic cases to infections increases
with age (1:1000 among young children and 1:75 among adults).
Sex Males usually have higher poliomyelitis attack rate than females.
Genetic Persons with sickle cell trait are associated with decreased risk of
malaria due to plasmodium malaria.
Blood group A has increased risk of gastric cancer while group O have
increased risk for duodenal ulcers.
Ethnicity Certain ethnic groups have increased risk for keloid and gastro-
intestinal tract cancer.
Physiology Pregnancy – candidiasis
Puberty – goiter, stress, nutritional state and fatigue
Immunology Hypersensitivity, allergy
Active Prior infection, immunization
Passive Maternal antibodies
Existing Pathology Pre-existing disease may initiate another by interfering with immunity
(e.g. malaria and herpes simplex, or other concurrent disease)
Behavior Personal hygiene, religion, customs, habits, utilization of health
resources and related diseases.
Figure 3. Factors Related to the Environment
Factor(s) Example(s)
Physical Climate, geology, radiation, heat, light, air pollution
associated with chronic respiratory disease.
Biological Human Population density
Flora Food source, influence on disease agents/vectors
Fauna Influences presence of host/vectors and agents
Socio-Economic Occupation Occupational hazards
Urbanization Crowding, stress
Development Education, poverty, availability of health services
Disruption War and conflict (Rwanda 1994/5), natural disasters
(Haiti earthquake, 2010)
Session 10: Ecology and Epidemiological
Approach to Causation
Learning Objectives
• Define basic concepts of ecology
• Describe the epidemiological models of disease causation
• Explain the concept of causation in epidemiology
• Identify the guidelines for causation of disease
Basic Concepts of Ecology

Definitions
• Ecology: The study of the relationship between organisms and their environments.
• Climax State
o This is the final constant condition of flora and fauna in a specified geographic area.
o The equilibrium reached varies, depending on climate and soil.
o The climax state is not stable, but is a dynamic condition.
o The main factor responsible for precluding stability is human activity.
o Human activity can change an area much more rapidly, profoundly and unpredictably
than changes in weather, climate or natural events.
For example: Irrigation projects, behavioral changes, wars, urbanization, etc.
• Food Chain
o This is a series of organisms each dependent on the next as a source of food.
o From the smallest plants or animals up to humans and large predators, plants and
animals serve as food for other species higher up in the food chain.
o Predators (lion, cheetah, and hyena) and birds of prey (hawks, eagles) are at the end of
the food chain.
o Throughout the food chain, toxins and chemicals can accumulate, usually originating
from environmental pollution. These toxins and chemicals can become concentrated
at dangerously high levels as they move up the food chain.
• Habitat
o This is the typical environment in which a certain species usually lives.
o The habitat of different species is very much interdependent. Some species require
very strict conditions, and are restricted to limited geographic areas, whereas others
can adapt easily to different habitats.
o Humans are adapted to an unusually wide habitat, partially as a result of technology.
o The habitat of domestic animals naturally coincides much with the human habit.
Psittacosis and toxoplasmosis are typical zoonoses originating from pets. This may be
a concern for people who live in close proximity with pets.
• Population Density
o Population means all the inhabitants of a particular place, and density refers to the
quantity of people or things in a given area or space.
o Fluctuations in plants and animal populations affect other species higher in the food
chain.
o This may become relevant to human diseases, especially zoonoses (plague, rabies).
o The density of human populations is an important ecological variable, as it affects
food availability, and has an impact on the physical environment.
Session 10: Ecology and Epidemiological Approach to Causation 89
For example, increased numbers of domestic animals can lead to overgrazing,
erosion, and desertification.
When humans enter new environments, they may encounter new habitats for
disease vectors, parasites, etc. (For example, onchoceriasis transmitted by
blackflies, trypanosomiasis transmitted by tsetse flies, yellow fever transmitted by
mosquitoes, etc.)
o High human population densities (urbanization) together with human patterns of
socialization provide good opportunities to diseases of close contact (droplet
transmission, sexual transmission, etc.).
o It has been found that the probability to contract certain diseases (e.g. measles) does
not depend on the proportion of susceptible individuals in a community, but rather on
the absolute number of susceptible individuals.
• Socialization
o Socialization refers to the way in which humans come in contact with each other.
Within each community or culture a great number of patterns exists in circumstances
such as work, school, religion and recreation.
• Adaptation
o A change or the process of change by which an organism or species becomes better
suited to its environment and better able to survive and reproduce.
o Two organisms can coexist in a host-parasite relationship; however, they often
demonstrate adaptation as the parasite becomes very virulent and the host develops
resistance.
o Due to selection pressure, more sensitive hosts will be weeded out or relatively
resistant mutants will develop.
o This is in the interest of the parasite as well, because the worst that might happen to
the parasite is extinction of its host.
• Herd Immunity
o Herd immunity is the ability of a community to resist disease. It can occur naturally
by exposure to infection or artificially by vaccination.
o High herd immunity indicates the decreased probability of a group or community to
develop a disease upon introduction of an infectious agent (although there may be a
certain number of persons who are individually susceptible to the disease agent).
o This decreased probability (resistance) is a product of the number of susceptible, and
the probability that those who are susceptible will come into contact with an infected
person/disease-causing agent.
o The percentage of vaccination/natural immunity required to produce herd immunity in
a population depends on the disease concerned, and on socialization patterns.
For example, measles transmission still occurs in Tanzania and other tropical
developing countries.
In these places the measles virus may be present in the community.
The number of cases depends upon the number of susceptible persons children
(children) exposed to the virus.
If there is cyclic pattern with peaks occurring at intervals of every 2-3 years, this
suggests that every 2-3 years there is a sufficient number of newly susceptible,
hence low herd immunity.
Epidemiological Models of Disease Causation

• Disease does not occur due to a single factor, but due to a number of interrelated factors.
• These factors can be divided into two categories:
o Essential/Necessary Factors
o Contributory Factors
• Essential/necessary factors are ‘required ingredients’ for disease to occur.
o These are agents of disease such as bacteria and viruses in infectious diseases, and
fire, nutrition, radiation or various poisons in non-infectious diseases.
• Contributory factors are host environment factors that are associated with increased
likelihood of disease occurrence.
o For example:
Host Factors: immunity, sex, age, etc.
Socio-economic factors: poverty, development, etc.
Physical/Environmental Factors: rainfall, temperature, etc.
Biological Factors: presence of vectors, animal reservoirs, etc.
• There are different models that are used to represent more complex systems in the
interplay of these factors and human ecology for disease causation. These models are:
o Epidemiologic Triangle
o Wheel Model of Disease Causation
o Web of Causation Model
Triangular Model
• In this model, also called the epidemiological triad, three main components are important
in the chain of disease transmission:
o Agent
o Host
o Environment
Figure 1: The Epidemiologic Triangle
• Under stable ecological conditions, the epidemiological triad is in a balanced state, and
the disease can be said to be absent or endemic.
• If this balance is disturbed and becomes unfavorable to the agent, the incidence of the
disease will decrease.
o If the situation remains unfavorable to the agent, the disease may become sporadic or
even disappear.
• If the balance alters to favour the agent, then an epidemic may occur.
• Although there may be a great deal of information available about a particular disease
agent, the roles of host and environment are not completely understood.
o For example we do not know why only some of the people exposed to large doses of
x-rays develop leukemia or why all heavy smokers develop lung cancer.
o Some diseases like schizophrenia, coronary heart disease, rheumatoid arthritis, and
essential hypertension have not been linked to any causative agent.
• Given these limitations, new models have been developed which de-emphasize the role of
the agent and stress the multiplicity of interactions between host and environment.
• This is described in the wheel model and web of causation model.
The Wheel Model

• The hub of the wheel is the host, which has genetic makeup as its core.
• Surrounding the host is the environment, systematically divided into three sectors:
biological, physical, and social environment.
• The relative sizes of the different components of the wheel depend upon the specific
disease/problem under consideration.
o For hereditary diseases, the genetic core would be relatively large. For conditions like
measles the specific core would be of lesser importance.
• The state of immunity of the host and the biological sector of the environment would
contribute more heavily in this model.
Figure 2: The Wheel Model of Disease Causation
Source: Kapiga S.H., et al., 1998.
The Web of Causation Model

• In this model many factors are associated with occurrence of disease.
• Effects do not depend on a single isolated cause, but rather develop as a result of
interrelated causation in which each link is a result of a complex genealogy of
antecedents.
• The large number of antecedents creates a condition that may approximately be
conceptualized as a web.
• For example:
o Cardiovascular disease:
Heredity, health behaviors (smoking, hormones, stress, diet (fat, total calories and
salt), physical activity, etc.), and health status (hyperlipidaemia, obesity, etc.)
interact to bring about atherosclerosis, hypertension and coagulation/clot lysis
which in turn lead to coronary heart disease, cerebro-vascular disease and
hypertensive disease.
o Jaundice (serum hepatitis) development after Syphilis Treatment:
Before the advent of antibiotics the treatment of syphilis was by intravenous
injection of arsenical compounds. Many of the syphilis patients who received this
treatment developed jaundice. In 1967 viral antigens (necessary factor) became
associated with hepatitis for the first time.
Activity: Group Discussion
Instructions
Think about the models of disease causation that you have just discussed. Brainstorming and
discuss different factors that result in or contribute to disease or disability.
Consider the following four categories: Host Factors, Social/Environmental Factors,
Physical/Environmental Factors, and Biological Environment Factors. Answer the question
Questions
• What factors contribute to road accidents? What category do these factors belong in?
• What factors contribute to coronary heart disease and essential hypertension? What
category do these factors belong in? .
Concept of Causation in Epidemiology

• An understanding of the causes of disease is important in the health field not only for
prevention but also in diagnosis and the application of the correct treatment.
• A cause of a disease is an event, condition, characteristic or a combination of these
factors which plays an important role in producing the disease.
• A cause is termed ‘sufficient’ when it inevitably produces or initiates a disease in an
individual, and is termed ‘necessary’ if a disease cannot develop in its absence.
Factors in Causation of Disease

• There are four types of factors that play part in causation of a disease:
o Predisposing Factors
o Enabling Factors
o Precipitating Factors
o Reinforcing Factors
• All may be necessary, but they are rarely sufficient to cause a particular disease or
condition.
• Predisposing factors such as sex, age, and previous illness may create a state of
susceptibility to a disease agent.
• Enabling factors such as low income, poor nutrition, bad housing and inadequate
medical care may favour the development of a disease. Also, circumstances that assist in
recovery from illness or in the maintenance of good health could be called enabling
factors.
• Precipitating factors such as exposure to specific disease agent or noxious agent may be
associated with the onset of a disease or condition.
• Reinforcing factors such as repeated exposure to an agent and during hard work may
aggravate an establish disease or condition.
• The term ‘risk factor’ is commonly used to describe factors that are positively associated
with the development of a disease but that are not sufficient to cause the disease.
o Some risk factors (e.g., tobacco smoking) are associated with several diseases and
some diseases (e.g., coronary heart disease) are associated with several risk factors.
o Epidemiological studies can measure the relative contribution of each factor to
disease occurrence, and also the corresponding potential reduction in disease from the
elimination of each risk factor.
Guidelines for Establishing the Cause of Disease

• In order to establish a cause of a disease, an assessment of factors related to the
occurrence of a disease needs to be done.
• Guidelines should be used to make a judgment about whether an observed association
should be deemed ‘causal’.
• ‘Causal’ means something of, relating to, or acting as a cause.
Refer to Handout 10.1: Assessing the Relationship between a Possible Cause and
an Outcome
• The following are steps in assessing the nature of the relationship between a possible
cause and an outcome:
Figure 3: Summary of Guidelines for determining Causation

Guideline Concept
Temporal Relation Does the cause precede the effect? (essential)
Plausibility Is the association consistent with other knowledge? (Mechanism of
action , evidence from experimental animals)
Consistency Have similar results been shown in other studies?
Strength What is the strength of the association between the cause and effect?
(relative risk)
Dose-Response Is increased exposure to the possible cause associated to the
Relationship increased effect?
Reversibility Does a removal of a possible cause lead to reduction of disease risk?
Study Design Is the evidence based on a strong study design?
Judging the Evidence How many lines of evidence lead to the conclusion?
• Temporal Relationship
o Exposure to the factor must necessarily precede development of the disease in order
to consider a causal association.
o This is usually self-evident, although difficulties may arise in case control and cross-
sectional studies when measurements of the possible cause and effect are made at the
same time and the effect may in fact alter the exposure.
o In cases where the cause is an exposure that can be encountered at different levels, it
is essential that a high enough level be reached before the disease occurs for the
correct temporal relationship to exist.
o Repeated measurement of the exposure at more than one point in time and in different
locations may strengthen the evidence.
• Consistency
o Consistency is demonstrated by several studies giving the same result.
o If the same association has been repeatedly observed by different researchers in
different places, under different circumstances and at different times, it is very likely
that the association is causal.
• Plausibility
o Biological plausibility is expressed when an association is plausible.
o If an association is consistent with other knowledge, it is more likely to be causal.
For instance, laboratory experiments may have shown how exposure to the
particular factor could lead to changes associated with the effect measured.
• Strength
o A strong association between possible cause and effect, as measured by the size of the
risk ratio (relative risk), is more likely to be causal than a weak association.
o This could easily be influenced by confounding or bias.
o The stronger the associations (high relative risk) the more readily we can accept direct
causation as likely explanation of the observed association.
• Dose-Response Relationship
o If a dose-response relationship can be demonstrated, then the likelihood that the
exposure is causal increases.
o The dose relationship occurs when changes in the level of the possible cause/agent are
associated with changes in the prevalence or incidence of the effect. (This is also
termed as ‘biological gradient’).
• Reversibility
o When the removal of a possible cause results in a reduced disease risk, the likelihood
of the association being causal is strengthened.
For example, the cessation of cigarette smoking is associated with reduction in
the risk of lung cancer relative to the risk in people who continue to smoke.
This finding strengthens the likelihood that cigarette smoking causes lung cancer.
o If the cause leads to rapidly irreversible changes in the subsequent disease regardless
of continued exposure (as with HIV infection), then reversibility cannot be a
condition for causality.
• Study Design
o The ability of study design to prove causation is the most important consideration.
o The best results come from the well designed, completely conducted, randomized
controlled trials; however, it is not always practical to use this study method for all
investigations, and data of this quality may not always be available.
• Judging the Evidence
o A particular exposure should produce one specific disease; otherwise there is a weak
argument in favour of causation.
For example, if an association is limited to specific workers, particular sites and
one type of disease, and there is no association is drawn between the type of
occupation and another disease, it is a strong argument in favour of causation.
Key Points
• The basic concepts of ecology are: ecology, climax state, food chain, habitat, population
density, socialization and adaptation.
• The concept of causation in epidemiology is an important element in determining disease
occurrence and trends.
• The common important epidemiological models of disease causation are the
epidemiological triangle, the wheel model, and the web of causation model.
• The guidelines for establishing the cause of a disease are: biological plausibility, temporal
relationship, consistency, strength, dose-response relationship, reversibility, study design
and judging the evidence.
Evaluation
• Define the following concepts of ecology: climax state, food chain, habitat, population
density, socialization and adaptation.
• Explain the epidemiological models of disease causation.
• What are the guidelines in establishing the cause of a disease?
References
• Kapiga S.H. et al. (1998). Lecture notes in epidemiology and research methodology.
Handout 10.1: Assessing the Relationship Between a Possible
Cause and an Outcome
Source: Bonita R., et al. (2006).
Session 11: Natural History and Levels of
Prevention of Diseases
Learning Objectives
• Define the concept of natural history
• Describe the stages of pathogenesis in the natural history of disease
• Recognize factors responsible for pre-pathogenesis, pathogenesis and post-pathogenesis.
• Identify the components of disease transmission
• Describe the roles of agent, source and host in disease transmission
Introduction to Natural History of Disease

• Natural History of Disease: Refers to the course of a disease over time, when unaffected
by any human intervention like prevention, treatment or rehabilitation.
• Pathogenesis: The step-by-step origination and development of a disease.
Stages of Pathogenesis in the Natural History of Disease

Pre-Pathogenesis
• This is a time before the onset of a disease in the human population; however, the factors
responsible or that favour its occurrence already exist in that population. (i.e., the
groundwork has been laid.)
• In this period, there is interaction between host, agent and environmental factors.
• This interaction or interplay ultimately produces an opportunity for the start of a disease,
or for infection (in the case of an infectious disease). (Also known as disease stimulus.)
• As epidemiologists we study the distribution of disease in human populations rather than
in an individual. This distinguishes epidemiologists from clinicians, who examine disease
in individuals. Epidemiologists, then, must study the natural history of disease in a
population.
• Factors Responsible for Pre-Pathogenesis

o Host factors include:
Age
Sex
Social class
Personality
Genetic factors
Education
Marital status
o Note: a host is an organism capable of being infected by an agent.
o Agent factors are etiological factors which are necessary for bringing about a
particular disease in a susceptible host. Examples of agents are:
Plasmodium
Yesinia pestis
Mycobacterium tuberculosis
o A factor whose presence is associated with an increased probability for the disease to
develop is called a risk factor.
Risk factors for the host may be fixed (e.g., age, sex, race) or modifiable (e.g.,
smoking habits, alcoholism and serum cholesterol).
Session 11: Natural History and Levels of Prevention of Diseases 99
o Agent factors are related to the survival in external environment. These include:
Infectivity
Pathogenicity
Virulence
Antigenicity
o Environmental factors can create favourable climates for the development of disease
agents or risk factors. Examples include:
Climate
Altitutude
Temperature
Presence and density of population vectors
Environmental sanitation
Culture
Economic conditions/poverty
Health services (quality, availability, etc.)
Social services
Pathogenesis
• This is the time between onset of disease stimuli or interaction between host and agent
and environmental factors up to the development of discernible lesions and or recovery,
progression of disease process to the formation of disability or death.
• This disease process period can be terminated or shortened by human interventions in
terms of treatment (or secondary prevention).
Post-Pathogenesis
• This is the stage where the agent (or necessary factor) has already been removed from the
affected populations (patients) but there is the effect of the disease persisting in the form
of disability (or sequel of the disease).
• A new disease may be a post-pathogenesis of another disease, e.g. rheumatic fever is a
post pathogenesis of streptococci sore throat.
Levels of Prevention
Prevention
• Prevention: Inhibiting the development of a disease before it occurs.
o This is a relatively narrow conceptualization of prevention.
o In epidemiology, ‘prevention’ includes measures that interrupt or slow the
progression of a disease.
• Four levels of prevention have been identified in epidemiology:
o Primordial Prevention
o Primary Prevention
o Secondary Prevention
o Tertiary Prevention
Primordial Prevention
• Primordial prevention: Preventing agents or risk factors; preventing the interaction
between host, agent and environmental factors, so that disease may not occur.
o Mainly deals with underlying conditions, and reflects more on non-communicable
diseases. (In contrast, primary prevention deals with a specific agent or causal factor.)
• Efforts and interventions at the primordial prevention level involve anticipation of disease
occurrence and modification of the conditions responsible for the occurrence, before
disease happens.
o Whenever possible, these efforts should be evidence-based, drawing from solid
research and experience in other areas/countries.
o Examples of primordial prevention interventions include:
Policy and public health interventions that discourage, limit, and/or
prohibit cigarette smoking. Cigarette smoking can lead to high blood pressure,
strokes, or lung cancer.
Environmental interventions that reduce air pollution, the greenhouse
effect, acid rain, and ozone layer depletion can also result in a reduction of the
prevalence and severity of respiratory problems in the general population.
Primary Prevention
• Primary prevention: preventing healthy people from becoming ill.
• Typically considered the most cost-effective form of healthcare, because these efforts
help offset the suffering, cost and burden associated with disease.
• Primary prevention helps to lower disease incidence and control disease.
• Examples of primary prevention include:
o Immunization
o Wearing shoes to prevent hookworm
o Adequate intake of proteins and vitamins to prevent malnutrition
o Use of mosquito nets to prevent malaria
o Health education and promotion initiatives aimed at fostering positive health
behaviors
For example, promoting the use of latrines, promoting condom use, etc.
Secondary Prevention
• Secondary prevention: identifying/detecting individuals who are already infected with a
given disease as early as possible, in order to stop the disease from spreading/developing
further.
• Infected individuals should be diagnosed and treated as early as possible, to increase
recovery rates and reduce disability, morbidity, and mortality rates.
o Screening for early diagnosis and treatment can be done for sub-clinical diseases
using laboratory tests.
o Clinical examination can be done to discover early manifestation of disease, which is
easier to reverse.
o Treatment can be provided via drugs, lifestyle modification, or by natural remedies.
Tertiary Prevention
• Tertiary prevention: preventing further disability, or preventing/postponing death due to
disability or secondary disease(s).
o Because the disease is now established, primary prevention activities may have been
unsuccessful.
o Early detection through secondary prevention may have minimized the impact of the
disease.
o Sometimes, the agent has been removed while the disability or effect of the disease is
still visible or felt.
• Tertiary prevention may refer to patients who have been cured of a primary disease but
have a permanent disability (e.g. post-polio paralysis or post-trachoma blindness) and/or
need rehabilitation.
• Examples of tertiary prevention include:
o Palliative care for patients with AIDS or cancer
The Role and Characteristics of Infectious Agents in Disease Transmission
Components of Disease Transmission

• Transmission of diseases in humans requires the following components:
o An agent, capable of infecting a human
o A source (an infected host or reservoir of infection)
o A portal of exit from the source
o A suitable means of transmission
o A portal of entry into a new, susceptible host
• Agent: the etiological factor which must be present for the disease, disability or
pathological state to occur in a susceptible host.
o An agent may be defined as the presence, absence, excess or deficiency of a certain
factor.
Characteristics of Infectious Agents

• Resistance: Refers to the ability of the agent to survive under adverse environmental
conditions.
o Some agents are remarkably resistant, such as Mycobacterium tuberculosis. It can
survive in a presence of alcohol, acid, clostridium tetani, etc.
o Others are extremely fragile, such as gonococci and influenza viruses. They cannot
survive for a long time in sunlight.
• Infectivity: The capacity of a micro-organism to enter a susceptible host.
o It can be expressed as the proportion of the susceptible population which is infected
by a particular organism.
o Experimentally it can be thought of as the minimum number of particles, pathogens or
agents necessary to cause infection in 50% of a group of hosts of the same species.
(This is also called the ‘infective dose 50%,’ or ID50.)
• Pathogenicity: Refers to the capacity of a micro-organism to cause overt disease in the
infected host.
o It can be expressed as the proportion of the infected population (to be established by
serological or other laboratory techniques) that develops overt disease.
• Antigenicity: Refers to the ability of the agent to induce antibody production in the host.
These antibodies may not necessarily be protective.
• Toxigenicity: Refers to the capacity of the agent to produce a toxin or poison that can
cause pathogenic effects on the host. The pathogenic effect of agents in diseases such as
botulism and shellfish poisoning depends on the toxin produced by the micro-organism
rather than on the direct effect of the micro-organism itself.
• Virulence: Refers to the severity of the disease. It is the degree of pathogenicity of an
infectious agent.
• One measure of virulence is the Case Fatality Rate (CFR) which can be expressed as
follows:
No. of persons dying of a disease during a
stated time period
Case Fatality Rate (CFR) = × 100
Total no. of persons with the disease during the
same time period
The Role of Source in Disease Transmission

• Source: An infected host or reservoir of infection. The infectiousness of a host differs
from disease to disease, but is usually consistent for any given disease.
o Many diseases are transmissible even during periods when the disease is not (yet)
clinically manifest.
For example, mumps can be transmitted during the incubation period, and
gonorrhoea is transmissible in asymptomatic cases.
o Some diseases can be transmitted years after recovery, as in a chronic carrier state (as
with typhoid and hepatitis B), or by reactivation of a latent infection (as with herpes).
• Reservoir of infectious agents refers to any human beings, animals, arthropods, plants,
soil or inanimate matter or a combination of these in which an infectious agent normally
lives and multiplies, and on which it primarily depends for survival and reproduction in
such a manner that it can be transmitted to a susceptible host.
o A reservoir of infection may also be termed as the ‘natural habitat’ of the infectious
agent.
Difference between Anthroponoses and Zoonoses

o Anthroponoses: Diseases in which humans are the only reservoirs (e.g. smallpox,
measles, cholera, etc.)
o Zoonoses: diseases which involve other animal reservoirs (e.g. plague, rabies).
The Portal of Exit as a Component of Disease Transmission
Portal of Exit in the Human Host

• Portals of exit in humans:
o Respiratory passages
o Alimentary canal
o Openings in the genito-urinary system
o Skin lesions
• Additional portals of exit may be made available through:
o Insect bites
o Drawing of blood
o Surgical procedures
o Accidents and other injuries
• For the chain of transmission to be continued, the portal of exit must be appropriate to
the particular agent.
• To produce infection:
o The agent must exit from the source in sufficient quantity
o Survive the environment and the defences surrounding the portals of entry in the new
host.
Suitable Means of Transmission
• Transmission of infectious agents is any mechanism by which a susceptible host is
exposed to an infectious agent. It may be either direct or indirect.
Direct Transmission
• Direct transmission means that the agent is transmitted directly from the infected host
(man or animal) to the new host (e.g. influenza, gonorrhoea, etc.) Direct transmission may
be horizontal or vertical.
o Examples of horizontal transmission of diseases include:
Droplet infection
Faeco-oral route
Genital
Direct skin contact
o Examples of vertical transmission of diseases include:
Transplacental
Genital tract
Indirect Transmission
• Indirect transmission requires a vehicle or vector to carry the agent of disease from one
host to another.
Vehicles
• Substances such as water, air, food, blood used in transfusion, or formites (inanimate
objects used by an infectious host, such as clothing, handkerchiefs, doorknobs etc.)
• Multiplication may or may not take place in or on the vehicle (e.g. in influenza, hepatitis,
and streptococcal disease)
Vectors
• Vectors may be either mechanical or biological, but are always living.
• Mechanical vectors are such animals and insects, which can carry agents from place to
place on their feet, proboscis, or other body parts
o For example, flies are vectors of Shigellosis. Flies can breed in infected feces, and
then contaminate food, which humans may ingest.
• Biological vectors must have growth or multiplication of organisms occurring within the
body of the vector
o For example, fleas infected with Yersinia pestis bacteria transmit plague to humans
and other mammals through their bites.
The Role of Host as a Component of Disease Transmission

• Host refers to the organism capable of being infected by the agent.
• Obligate host
o An obligate parasite is one that is entirely dependent on its host for survival.
o If a human body is necessary for the life cycle or for the continued existence of an
agent, the human constitutes the obligate host for this agent.
o Humans are the obligate hosts for such agents as malaria (plasmodium species).
• Incidental hosts (occasional or accidental)
o Hosts that are not usually involved in the natural cycle of transmission.
o Humans are incidental hosts for diseases like measles, salmonella typhi, etc.
Characteristics of the Host
• Infected
o An infected person harbours an infectious agent and has either manifest disease or an
asymptomatic infection.
o They may or may not be infectious to others.
• Infectious
o A person from whom the infectious agent can be naturally acquired.
o A person or their articles or clothing may also be merely contaminated with an
infectious agent, without being infected.
• Immune
o A person who possesses specific protective antibodies or cellular immunity as a result
of previous infection or immunisation.
o Immunity is relative: an ordinarily effective protection may be overwhelmed by an
excessive dose of the agent.
o It may also be impaired by immune-suppressive drug therapy or concurrent disease
(such as AIDS).
• Inherent resistance
o The ability to resist disease, independent of antibodies or specifically developed tissue
response.
o It commonly resides in anatomical or physiological characteristics of the host, and
may be genetic or acquired, permanent or temporary.
• Susceptible
o A person is considered susceptible if they do not possess sufficient resistance
(inherent and/or acquired) against a particular pathogenic agent to prevent contracting
a disease when exposed to the agent.
o Persons or animals must belong to a species that is biologically capable of being an
efficient host to the agent in question.
o Susceptibility of a host may be modified by characteristics, such as age, sex, race,
genetic make-up, physiological state, habits and customs, pathological state and
previous experience with the agent (immunity).
Take-Home Assignment: Natural History of Disease and Levels of

Prevention
Take-Home Assignment: Natural History of Disease and Levels of Prevention
Instructions
Refer to Worksheet 11.1: Natural History of Disease and Levels of Prevention for
details on the steps for this assignment.
You will work in small group to accomplish the assignment as instructed by the tutor.
Complete Option 1: Written Assignment or Option 2: In Class Presentation based on class
schedule and amount of time available.
Make sure you do the following:

• Outline the natural history of the disease/health issue that they are investigating.
• Identify a comprehensive list of actions that can be taken to prevent morbidity and mortality
at each level (primordial, primary, secondary, tertiary).
• Include a list of resources/references in assignment/presentation.
• Prepare a brief written paper or in-class group presentation (Option 1 or Option 2).
This is a take-home assignment and you need to submit your work before the next session.
Key Points
• The natural history of disease is an important for in finding the effect of course of disease
in order to be able to determine the diagnosis, treatment and prevention of disease.
• There are four levels of prevention of disease which all have interplay in the natural
history of disease: primordial, primary, secondary, and tertiary.
• Transmission of infectious agents can be direct or indirect in which a susceptible host is
exposed to an infectious agent.
Evaluation
• Define natural history of disease.
• What are the three stages of pathogenesis in natural history of disease?
• What are the four levels of prevention?
• Describe the six key characteristics of infectious agents in disease transmission.
References
• WHO. 2007. World health statistics 2007. Geneva, Switzerland: World Health
Organization. Retrieved from: http://www.who.int/whosis/whostat2007/en/ (date
unknown)
Worksheet 11.1: Natural History of Disease and Levels of
Prevention
Instructions
• The class will work in small groups.
• Each group will investigate a different topic. Groups should work together to:
o Outline the natural history of the disease/health issue that they are investigating
o Identify a comprehensive list of actions that can be taken to prevent morbidity and
mortality at each level (primordial, primary, secondary, tertiary).
o Include a list of resources/references in assignment/presentation.
o Allow 2-3 hours for group discussion and report/presentation preparation.
• The instructor will inform students to follow instructions for Option 1 or Option 2 below.
o Option 1: Written Assignment
Each group will prepare a brief written paper outlining the content above (natural
history of disease, and opportunities for prevention at all four levels).
Assignment will be handed in to the instructor for feedback and grading.
o Option 2: In Class Presentation
Each group will prepare a brief presentation detailing the content above (natural
history of disease, and opportunities for prevention at all four levels). All
resources for presentation (flipcharts, etc.) should be prepared in advance.
Instructor will inform students about the date of their group presentation, and
groups will present and discuss in class.
Group Assignments:
1. Road and Traffic Safety: High rate of road traffic accidents in Tanzania resulting in
injury and death to motorists and pedestrians
2. Infant Mortality: High infant mortality rate in Tanzania: 76 children per 1,000 live
births (2005) (Source: World Health Organization, 2007.)
3. Maternal Mortality: High maternal mortality ratio in Tanzania: 1500 women per
100,000 live births (2000) (Source: World Health Organization, 2007.)
4. Skilled Birth Attendance: In Tanzania, 47.1% of live births take place at a health
facility, and 52.7% take place at home. (Source: Tanzania Demographic and Health
Survey, 2004-5)
5. Child Health: Low coverage of primary health indicators, including number of children
under-five immunized, nutritional status of children under five, etc.
Session 12: Introduction to Epidemiological
Methods/Studies
Learning Objectives
• Describe key types of studies/methods in epidemiological research
• Describe types of analytical and descriptive surveys
• Specify the formulas for calculating relative risk and odds ratio
• Identify three methods of hypothesis formation
Introduction to Epidemiological Research Studies/Methods

Types of Epidemiological Studies
• Epidemiological studies are designed to determine the extent and distribution of diseases
and their determinants (causes) in human populations with the aim of identifying effective
management of preventive measures.
• Depending on the type of health problem to be investigated, a variety of study designs
have been developed. These can be grouped into two broad categories:
o Observational Studies
o Experimental Studies
• Observational studies involve only observational techniques to study a population,
whereas experimental studies employ interventions/manipulations and study their effects
within a population.
• Surveys are investigations in which information is systematically collected but in which
the experimental method is not used.
Refer to Handout 12.1: Types of Epidemiological Studies.
Types of Observational Studies

• There are two types of observational studies:
o Descriptive Studies
o Analytical Studies
Descriptive Studies
• Descriptive studies are useful in studying the natural history of diseases.
• Descriptive studies describe tell us about the distribution of disease and disease
determinants in human populations.
• They are generally used to describe exposure variables and patterns of disease occurrence
with all types of studies.
o They explain the what, who, when, and where of health events.
o They provide information about person, place and time.
Person/Who: What are the characteristics of the persons affected by the disease?
(age, sex, race, socioeconomic status, genetic constitution, immunological status,
etc.)
Place/Where: What are the geographic characteristics of individuals and groups
who are affected by a particular disease? (geographical placement, altitude,
Session 12: Introduction to Epidemiological Methods/Studies 109
latitude, climate, vegetation, proximity to another key location such as body of
water or factory, etc.)
Time/When: Does the disease have any time trend? Many infectious diseases
occur during certain periods of the year (seasonal distribution).
• Descriptive studies are sometimes called descriptive statistics.
o Data can be presented in the form of rates, frequency distributions, measures of
central location and dispersion, graphs, charts and maps.
• Descriptive studies can be time-bound, or can be ongoing.
o For example, disease registries in a particular area are not time-limited. They provide
an ongoing record of various characteristics of the affected individuals including age,
sex, occupation, duration of symptoms, etc.
o Cross-sectional studies are examples of time-limited descriptive studies.
• Key types of descriptive studies include:
o Ecological/Correlational Studies
o Case Reports/Case Series
o Cross-Sectional studies (which can be considered partially analytical)
Refer to Handout 12.2: Summary of Types of Descriptive Studies
Analytical Studies
• Analytical studies (or explanatory studies) try to explain a disease in context (i.e., provide
a situational analysis).
• These studies are designed specifically to explain the determinants (i.e., the how and the
why) of disease. They answer the following questions:
o Why does the disease occur in the persons experiencing it, and not in the persons not
experiencing it?
o Why do certain persons fail to make use of health services?
o Can the decreased incidence of the disease be attributed to the introduction of
preventive measures?
• To answer these questions, hypotheses are formulated and tested that may help to explain
the situation.
• Examples of analytical studies include:
o Ecological Studies
o Cross-sectional Studies
o Cohort Studies
o Case-Control Studies
• Note that there may be some overlap between the categories of Analytical Studies and
Descriptive Studies. Some types (ecological, cross-sectional) often fit into both
categories.
• Analytical studies can be done at the group or the individual level.

• In group surveys, populations are compared (e.g. groups from different regions). Group
information is obtained from selected regions or from the same region at different time
periods. For example:
o National rates of heart disease rates and consumption of animal fat in a country/region
• Inferences from such surveys are usually regarded as hints, because they often suffer from
ecological fallacy.
o ‘Ecological fallacy’ is when a person believes that what they observe at a group level
also applies on an individual level. It is easy to draw weak or false conclusions, such
as the following examples:
Proportions affiliated to certain religions in a country, and suicide rates in a
country. (While a country with a higher proportion of Protestants may also have
higher suicide rates than a country with different religious composition, there is no
evidence that individual Protestants are more likely to commit suicide than a
member of any other religious group.)
Proportion of the population in mining occupations and lung cancer rates in the
country.
• Individual level surveys also survey groups; however, they utilize information from
individuals. Such surveys are performed to test hypotheses that a specific factor is related
to a specific disease.
o Cross-sectional studies have an unselected population (i.e., prevalence studies)
o Case-control and cohort studies require data referring to more than one point in time,
and hence are longitudinal/time-span studies.
Case Reports and Case Series
Case Reports/Case Series

• Case reports describe the experience of a single patient or a group of patients with a
similar diagnosis.
• Case reports document unusual medical occurrences and can represent the first clue in the
identification of new disease or adverse effects of exposure (e.g. drugs/treatments).
• Case series are a collection of individual case reports, which may occur within a short
period of time.
o For example, Pneomocystis juroveci pneumonia was initially described in young men
who have sex with men (MSM) who were initially healthy. Prior to this, it had only
been observed among elderly patients suffering from debilitating diseases like cancer.
Kaposi sarcoma was also newly observed among young, previously healthy
MSM.
These observations led to the idea that this was probably new/emerging
disease.
Eventually, these observations led to the medical discovery of Human
Immunodeficiency Virus (HIV) and Acquired Immune Deficiency Syndrome
(AIDS).
• Limitations of case reports and case series
o Case reports and case series cannot test for the presence of valid statistical association
between disease and another factor.
o They are based on the experience of one individual, rather than a group/population.
• Uses of case reports and case series
o Can help establish a foundation for hypotheses about disease causation, etc.
Ecological Studies
Ecological Studies
• These studies are often referred to as correlational studies.
• They frequently initiate the epidemiological process and allow for more detailed analysis
of observed correlations.
• Measures that represent characteristics of the entire population are used to describe
disease in relation to a factor of interest.
o Factors of interest may include per capita food consumption, per capita cigarette
consumption, infant mortality, mean annual rainfall, etc.
o For example, in one country a relationship was demonstrated between average sales
of an anti-asthma drug and the occurrence of an unusually high number of asthma
deaths.
• The two variables are correlated, and the measure of correlation is called the correlation
coefficient (r).
o For example, correlating per capita cigarette smoking and the occurrence of lung
cancer.
• The correlation coefficient (r) quantifies the extent to which there is linear relationship
between exposure and disease.
o It ranges from -1 to +1.
• The units of analysis are groups of people rather than individuals.
• Such relationships may be studies by comparing populations in different countries at the
same timed, or the same population in one country at different times. (The latter approach
may avoid some socio-economic confounding.)
• Although simple to conduct and thus attractive, ecological studies are often difficult to
interpret since they usually rely on data collected for other purposes, and essential
exposure data may not be available.
• Because the unit of analysis is a population or group, the individual link between
exposure and effect cannot be made.
• One attraction of ecological studies is that data can be used from populations with widely
differing characteristics.
o For example, correlation between esophageal cancer rates in communities with
different patterns of salt consumption.
• Ecological fallacy or bias may result from ecological studies if one erroneously infers that
relationships established between two or more variables, measured at an aggregate level,
will also hold at the individual level.
• Advantages of correlational studies:
o They can be undertaken rapidly
Often, these studies use routine data that is already available.
o They are inexpensive to conduct
• Limitations of correlational studies:
o Unable to link exposure with disease in particular individuals
For example, a correlational study cannot prove a link between increasing pap
smear rates among women and decreasing mortality from cancer of the cervix. In
this case it is not easy to prove that those women who underwent pap smears are
the same who experienced a reduction in cancer mortality.
o Lack of availability to control for the effects of potential confounding factors.
For example, it has been shown that there is strong correlation between per
capita number of colour television sets and Coronary Heart Disease (CHD) in
various countries. Owning a colour television is an indicator of economic well-
being, and related to other factors which are more likely to increase the risk of
CHD than owning a colour television.
Cross-Sectional Studies
Cross-Sectional Studies
• Cross-sectional studies or prevalence studies are carried out at a certain point in time and
in a given population or geographical area.
• They depend on a single examination of a cross-section of the population in which sick
and healthy, or exposed and unexposed, are not distinguished until results are examined.
• Information is collected through surveys/questionnaires, and/or laboratory or physical
examination of individual members of the study population.
• The prevalence of a risk factor or a disease is expressed as the proportion of the affected
individuals in the study population in a given geographical and area at a given point in
time.
• From a well-defined population, disease status and exposure are assessed simultaneously.
• The point in time could be:
o Calendar year (mid-year, mid-month)
o A fixed point in the course of events
This varies in real time from person to person. Examples include menarche,
adolescence, military recruitment, school age, etc.
• An important limitation of cross-sectional studies is their inability to sort out cause and
effect relationships since both are found in the study population at the same time.
• Cross-sectional studies can be used for the following purposes:
o To determine the magnitude of disease or disease determinants in a community in
terms of their prevalence.
o To study preliminary associations between disease and possible aetiological factors by
comparing the characteristics of the sick with those of the healthy.
o To screen undiagnosed disease in a community.
• Limitations of cross-sectional studies:
o Not possible to determine whether the exposure preceded or resulted from the disease.
o Reflects determinants of survival as well as aetiology.
Stages in the design of cross-sectional studies

• Specify the aim of study (relevant to disease control or management).
• Define the study population, including sample size.
o Sample size should be adequate for valid estimates.
• Conduct sampling, recruit participants, and manage the sample in order to achieve a high
response rate (which is important for valid estimates).
• Examine and conduct interviews and record retrieval methods.
• Handle and manage data.
• Analyze, interpret results, and write reports.
Cohort and Case-Control Studies
Refer to:
• Handout 12.3: Summary of Case-Control and Cohort Studies
• Handout 12.4: Measures of Effect in Cohort and Case-Control Studies
Note: students will have additional practice calculating risk ratios and odds ratios in later
sessions.
Cohort Studies
• Cohort studies are synonymous with prospective studies, longitudinal studies, follow-up
studies or incidence studies.
• They can be carried out prospectively or retrospectively (i.e., historically).
• These studies are carried out on a sample of the population to determine the rate at which
groups of the population develop disease or die from it when differentially exposed. This
is one way of testing a hypothesis in disease causation.
• Basic data from cohort studies are represented in Figure 1:
Figure 1: Exposure Status Among Diseased and Non-Diseased

Exposure Disease status
Status Diseased Healthy Total Incidence
Exposed a b a+b a/(a+b)
Not exposed c d c+d c/(c+d)
Total a+b b+d a+b+c+d
A Measure of Effect of Exposure: Relative Risk (RR)

• Relative Risk: the risk developing a disease relative to exposure.
o Relative risk is a ratio of the probability of the event occurring in the exposed group
versus a non-exposed group.
o Also called ‘Risk Ratio’
• The incidence of disease among the exposed (a/a+b) is divided by the incidence of disease
among the unexposed (c/c+d).
a
a+b
Relative Risk =
c
c+d
• Association is said to exist between exposure and development of disease if the measure is
significantly different from unity.
• Standard statistical techniques are available to test this difference.
• Although cohort studies may take long to accomplish, they have the advantage that they
are more reliable in providing evidence for causation than other analytical studies.
Case-Control Studies
• Also called case-referent studies.
• These studies involve the comparison of cases of the disease under study with
comparable controls for levels of exposure.
• The effect of exposure in such studies is measured by the Odds Ratio (OR = ad/bc) as
shown in Figure 2, which is an approximation to the Relative Risk (RR).
• Case-control studies are by far the simplest in determining cause and effect relationships.
• They take a short time to complete and have the advantage that they use cases of the
disease under study and are especially useful for studying rare diseases.
• Important disadvantages of case-control studies are that they are more susceptible to
selection bias and that information on exposure is less accurately ascertained than in
cohort studies.
Figure 2: Exposure Status Among Cases and Controls
Exposure Disease status
status Cases (diseased) Controls (healthy) Total
Exposed a b a+b
Not exposed c d c+d
Total a +c b +d (a + b + c + d)
A Measure of Effect of Among Cases and Controls: Odds Ratio (OR)

• Odds Ratio: the ratio of the odds of being exposed in the group with the disease to the
odds of being exposed in the group without the disease. It is a comparison of
probabilities.
• This is obtained by dividing the odds of exposure among to that among the controls.
a
c a×d
Odds Ratio = =
b b×c
d
Time Considerations in Types of Studies

• Studies can be cross-sectional or longitudinal.
• Cross sectional surveys provide information concerning the situation that exists at a
given time (e.g. weights of children at a given time).
• Longitudinal surveys provide data concerning events during a period of time (e.g.
weights of children as they grow older, each child being weighed more than once during
the study period).
• In determining the natural history of a disease (e.g. tuberculosis or AIDS) individuals are
followed-up through acquisition of the disease and appearance of signs and symptoms to
recovery chronicity, development of various complications or death.
Experimental Studies
• An experimental study is an investigation in which the researcher wishes to study the
effects of exposure to or deprivation of a defined factor, and designs a situation in which
subjects (persons, animals, communities, etc.) will be exposed to or deprived of the factor.
• If the investigator compares subjects exposed to the factor with subjects not exposed to it,
the study is a controlled experiment.
• Common experimental studies include:
o Intervention studies
o Clinical trials
o Prophylactic trials
Other Types of Research Studies
Refer to Handout 12.5: Additional Study Types and Terminology for a list of
common research conducted in the healthcare sector that do not investigate epidemiology.
Methods of Hypothesis Formulation
• There are three methods of hypothesis formulation.
o Method of Difference
If the disease frequency is significantly different between two sets of
circumstances, the disease might have a causal association with a particular factor
that differs between the two.
o Method of Agreement
If a single factor is common in a number of circumstances in which the disease
occurs, causal association can be suspected.
o Method of Concomitant Variation
If a factor varies in proportion to the frequency of disease, causal association
can also be suspected
Key Points
• The main categories of epidemiological studies are observational studies and
experimental studies.
• The aim of descriptive studies is to generate ideas or hypotheses for association(s)
between risk factors and diseases, while analytical studies use a comparison group to
establish an association.
• Case control studies are important in determining disease prevalence, while cohort studies
are useful in determining disease incidence.
• There are three methods of hypothesis formulation:
o Method of agreement
o Method of difference
o Method of concomitant variation
Evaluation
• What are the main categories of epidemiological studies?
• Name two types of analytical studies.
• What is the formula for calculating odds ratio in case control studies?
• What are the three methods of hypothesis formation?
References
• Freedman, D. (1999). Ecological Inference and the Ecological Fallacy. Retrieved June
18, 2010 from http://www.stanford.edu/class/ed260/freedman549.pdf.
• Kapiga, S. et al. (1998). Lecture notes in epidemiology and research methodology.
• MacMahon B., Trichopoulos, D. (1996). Epidemiology: principles and methods. Boston:
Little, Brown, & Co.
• Rosner, B. (2006). Fundamentals of Biostatistics. (6th edition). Belmont, CA: Thomson
Brookes/Cole.
CMT 05101 Epidemiology and Biostatistics
Session 12: Introduction to Epidemiological Methods/Studies
NTA Level 5 Semester 1
Source: Adapted from Kapiga S., et al. (1998).
Handout 12.1 Types of Epidemiological Studies
Note: The above figure is a very simplified one and is meant to assist understanding the possible relationships
in epidemiological methods. In practice, there will be overlaps between certain methods.
Student Manual
117
Handout 12.2: Summary of Types of Descriptive Studies
Descriptive Studies
Populations Individuals
(Ecological)
Cross-Sectional
Case-Reports
Case-Series
Handout 12.3: Summary of Case-Control and Cohort Studies
Kapiga S.H. et al. (1998). LNER. MUCHS.
Handout 12.4: Measures of Effect in Cohort and Case-Control
Studies
Cohort Studies and Relative Risk

• Relative Risk: the risk developing a disease relative to exposure.
o Relative risk is a ratio of the probability of the event occurring in the exposed group
versus a non-exposed group.
o Also called ‘Risk Ratio’
• The incidence of disease among the exposed (a/a+b) is divided by the incidence of disease
among the unexposed (c/c+d).
Figure 1: Exposure Status Among Diseased and Non-Diseased
Disease status
Exposure Status
Diseased Healthy Total Incidence
Exposed a b a+b a/(a+b)
Not exposed c d c+d c/(c+d)
Total a+b b+d a+b+c+d
a
a+b
Relative Risk =
c
c+d
• Association is said to exist between exposure and development of disease if the measure is
significantly different from unity.
• Standard statistical techniques are available to test this difference.
Case-Control Studies and Odds Ratio

• Odds Ratio: the ratio of the odds of being exposed in the group with the disease to the
odds of being exposed in the group without the disease. It is a comparison of
probabilities.
• This is obtained by dividing the odds of exposure among to that among the controls.
Figure 2: Exposure Status Among Cases and Controls
Disease status
Exposure status
Cases (diseased) Controls (healthy) Total
Exposed a b a+b
Not exposed c d c+d
a
c a×d
Odds Ratio = =
b b×c
d
• Interpreting Odds Ratios:
OR = 1 An odds ratio of 1 means lack of association between exposure and disease.
OR ≠ 1 Odds ratios different from 1 indicate the possibility of an association between exposure
and disease.
OR > 1 If the odds ratio is greater than 1, exposure to the factors will lead to an increased risk
of disease
OR < 1 If the odds ratio is less than 1 it shows a protective effect of the exposure under
investigation.
Handout 12.5: Additional Study Types and Terminology
• Evaluative Studies
o Carried out to appraise the value or quality of healthcare or health programs
• Programme review
o A programme is any enterprise organized to eliminate or reduce one or more
problems.
o A programme review evaluates the care given to specific patients, communities, or
populations, or may evaluate a particular programme that operates in a defined setting
with specified aims and goals.
o Programme examples include immunization coverage in an area, iodization of salt for
goitre prevention, fluoridation of water supplies for the prevention of dental caries,
etc.
• In-built evaluation
o An evaluation that is planned in advance (for example, during the early programme
planning phase) and the required evaluation data is collected in a systematic way as an
integral part of the provision of service.
• Medical audit
o A type of evaluation in which the quality of service is evaluated by appraising the
quality of care given to individuals. If a medical audit discovers that some useful
procedure is not being done in the course of patient care, then it recommends that it
should be done (e.g., testing of blood for Hepatitis B virus before transfusion).
• Surveillance
o A systematic collection, analysis, and use of information for the control of a specific
disease. Generally it is used to observe the ongoing health status of a
community/population.
• Pilot study
o A dress rehearsal of an investigation performed in order to identify defects in the
study design.
• KAP studies
o Studies of Knowledge, Attitudes, and Practices (KAP) towards healthcare or health
beahviors. These are important in health education methods.
• Operational research
o Research concerned with organizational problems that seeks to determine how best a
service can be provided given all the possible constraints.
o For example, we know vaccination against measles is effective and could eradicate
measles, but there are many constraints, including:
Cold chain maintenance problems
Faulty immunization techniques
Lack of proper supervision
Inadequate facilities and resources
Lack of staff motivation
Lack of community motivation
o Operational research determines which one of these constraints is most important in
the control of measles.
Session 13: Case- Control Studies
Learning Objectives
• Describe case-control studies
• Explain advantages and disadvantages of case-control studies
• Describe retrospective and prospective case-control designs
• Discuss measures of effect from a case-control study
• Calculate an odds ratio from a 2-by-2 table
Introduction to Case-Control Studies

• Case-control studies are also called case-referent, case comparisons or case history
studies.
• A case-control study is an epidemiologic investigation which involves comparing the
characteristics of diseased persons (the cases) with those of non-diseased persons (the
controls).
• The purpose of this comparison is to identify factors which occur more (or less)
frequently in the cases as compared to the controls, and hence provide clues regarding the
role of these factors in elevating (or reducing) the risk of the disease under investigation.
Advantages of Case-Control Studies

• Efficient in time and cost.
• Efficient for the study of rare diseases.
• Efficient for the study of chronic diseases.
o Tends to require a smaller sample size as opposed to other designs.
o Allows exploration of a large number of exposures for the disease under investigation
o May be completed more rapidly than alternative designs.
o Ethical problems are minimal
o Subjects need not be volunteers
o Attrition problems are minimal
o Used to test the hypothesis that disease in the cases is or is not associated with
exposure to some factor. Hence, case-control studies can provide suggestive evidence
of a causal relationship that warrants public health intervention to reduce exposure to
the risk factor.
Disadvantages of Case-Control Studies

• Both the exposure and disease have already occurred at the time the participants are
recruited into the study. Therefore,
o The design is susceptible to bias arising from differential selection of cases or controls
into the study
o Bias on the basis of their exposure status as well as from differential reporting or
recording of exposure information among study groups based on their disease status.
• Inefficient in ascertainment of rare exposures.
• Rates cannot be computed directly unless a modified design is adopted.
• Temporal relationships between exposure and disease are usually difficult to ascertain
Session 13: Case-Control Studies 123
Types of Case-Control Study Designs
• Retrospective Case-Control Design
• Prospective Case-Control Design
Retrospective Design
• All the case have been diagnosed by the time the investigator initiates the study
• They use common cases
Disadvantages of Retrospective Design

• Liable to sampling bias of systematic difference between the study population and the
target population that poses a threat to external validity.
o Patients who die or recover rapidly, as well as those with mild symptoms whose
illness goes undetected are not represented in the study sample.
o As a result, only survivors are studied. These survivors are unlikely to be
representative of the total patient population.
• Temporal relationship between exposure and disease is difficult to establish.
Prospective Design
• The study is begun and all new cases diagnosed within a specified period of time are
included into the study.
• This design tries to avoid studying survivors and hence addresses aetiological factors
as close as possible to the commencement of the disease process.
Design and Conduct of Case-Control Studies
Selecting Cases
• Cases should represent as far as possible a homogenous disease entity to ensure that the
cases selected represent a homogenous disease entity.
o Establish strict diagnostic criteria for the disease.
For example, example cervical cancer or cancer of the body of the uterus, and not
uterine cancer (which includes both types)
o Depending on the certainty of the diagnosis and the amount of information available it
is often useful to perform the analysis and present the results separately for cases
classified as definite, probable or possible.
• Selection of cases should be done from a well-defined population called source
population.
• Possible sources of cases are:
o Hospital or healthcare facility
Commonly involves identifying people with disease who have been treated at a
particular health facility during a specified period of time.
Although such cases may be identified easily, the underlying source population
may not be well defined.
o General Population
All diseased individuals or a random sample from a defined population.
Avoids bias from selection factors that led the affected individual to utilise a
certain health care facility.
Selecting Controls
• A control is as much like a case as possible except that they do not have the disease
(outcome) in question.
• They must have the same opportunity for exposure as a case and must be subject to the
same inclusion and exclusion criteria.
• No one control group is optimal for all situations.
• Scientific, economic and practical considerations should be sought before selecting
controls.
Sources of Controls
• Hospital/Healthcare Facility
o These may be patients admitted at the same hospitals as the cases for conditions other
than the disease being studied.
o Advantages Selecting Controls from Hospital/Health Care Facility
They are easily identified.
They are readily available and minimize assembling costs.
Minimizes potential for recall bias in both cases and controls.
Provides for similar hospital selection factors for cases and controls
Likely to co-operate than healthy individuals hence minimizing non- response.
o Disadvantages Selecting Controls from Hospital/Health Care Facility
They are ill by definition and differ from the health in a number of ways that may
be associated with the exposure.
The experience of these patients may not accurately represent the exposure
distribution in the population from which the cases are derived.
• General Population
o This is the source to be used when the cases have been selected to represent affected
individuals in a defined general population and also when hospital based controls are
not scientifically desirable or feasible.
o Recruitment may be done by random sampling methods using the available sampling
frame (administrative), selection of individuals from population registers, voting lists,
census records, salary lists, etc.
o Though controls from the general population may represent the non-ill individuals in
the community they have some limitations:
Usually costly and time consuming.
Sampling frame may not be available.
Difficult to get hold of busy people with a lot of scheduled activities.
May not recall exposures at the same level as the cases.
Individuals who have not experienced an adverse health effect, are less motivated
to participate, hence non-response may be higher than in hospital based controls
• Special Groups
o These include friends, relatives, spouses and neighbours of the cases.
o Advantages of Using Special Groups for Controls
They are healthy like other members of the population.
Likely to be co-operative due to the interest they have in the health of the case.
May offer some degree of control of confounders such as ethnic background,
socio-economic status, environmental factors, etc.
o Disadvantages of Using Special Groups for Controls
Due to the closeness of the controls to the cases, distribution of the factor under
study may be similar in the cases as in the controls, and hence an underestimate
of the true effect of the exposure may occur.
How Many Controls should be selected per Case?
• The precision of the study can be improved by increasing the number of study subjects.
• In a case-control study, the limiting factors are usually the number of available cases.
Controls may be easier to find. So we can increase the number of controls. This will
increase the power of the study, but not infinitely.
• Going from one to two controls per case vastly increases the power. Above a ratio of 4 or
5 to 1, the gain becomes very small. It is usually not worthwhile to have a ratio this large,
unless the information is already collected and easily accessible (for instance in a
computer database/file).
• There is a switch over between the need for higher study power, and the cost of finding
more cases and controls.
Ascertainment of Exposure Status

• After the cases and controls have been defined in terms of characteristics and sources, the
information on exposure status must be obtained. This may be done by interview, mail
questionnaire, etc.
• Procedures must be similar for cases and controls.
• Interview settings should be the same.
• Interviewer bias should be minimized.
o When possible the interviewer should not know which subject is a case and which is a
control.
o Interviewers should not know the hypothesis under investigation.
Major Steps in Case-Control Study

• Define and select cases
• Select controls
• Ascertain exposures
• Compare exposure in cases and controls by obtaining proportions or odds ratios.
• Test any differences for statistical significance.
Measures of Effect from a Case-Control Study

• Results from a case-control study can only provide an estimate of the odds ratio and this
can be done under the following assumptions.
o The controls are representative of the general population with respect to the frequency
of exposure.
o The assembled cases are representatives of all the cases of the disease.
o Frequency of the disease in the population is small.
• Under these assumptions the Odds Ratio (OR) can be used to estimate the relative risk
and can be calculated as the ratio of odds of exposure among the cases to that among the
controls (for unmatched designs).
Figure 1: Exposure Status among Cases and Controls

Disease status
Exposure Status
Cases (diseased) Controls (healthy) Total
Exposed a b a+b
Not exposed c d c+d
a
c a×d
Odds Ratio = =
b b×c
d
• In interpreting results from case control studies and indeed from any other
epidemiological study, the following should always be considered:
o Bias minimized by adopting a good design.
o Chance should be evaluated by statistical methods.
o Confounding may be minimized by adopting adequate design and adjusting it in the
analysis.
• When the above are taken care of, the odds ratio may be interpreted as follows:
o OR = 1
An odds ratio of one means lack of association between exposure and disease.
o OR ≠ 1
Odds ratios different from one indicate the possibility of an association between
exposure and disease.
o OR > 1
If the odds ratio is greater the one exposure to the factors will lead to an increased
risk of disease
o OR < 1
If the odds ratio is less than one it shows a protective effect of the exposure under
investigation.
Instructions
You will work in small groups to calculate and interpret Odds Ratio (OR). Record your
answers. Prepare to share your answers in plenary.
REFER s to:
• Worksheet 13.1: Calculating Odds Ratio
Key Points
• A case control-study is an epidemiologic investigation which involves comparing the
characteristics of diseased persons (the cases) with those of non-diseased person (the
controls).
• The purpose of this comparison is to identify factors which occur more (or less)
frequently in the cases as compared to the controls
• A case-control study provides clues regarding the role of the factors in elevating or
reducing the risk of the disease under investigation.
Evaluation
• What are the advantages of case control studies?
• What are the disadvantages of case control studies?
• What are the advantages of selecting controls from hospital or health facility?
• What are the disadvantages of selecting controls from hospital or health facility?
• What are the advantages of using special groups for controls?
• How is the Odds Ratio interpreted?
References
Geneva, Switzerland: WHO.
• Greenberg, R.S., et al. (1993). Medical epidemiology. East Norwalk, CT: Appleton
Lange.
• Mausner J.S., Kramer, S. (1984). Epidemiology: An introductory text. Philadelphia:
Saunders.
Worksheet 13.1: Calculating Odds Ratio
Instructions
• Work together in small groups to calculate and interpret the Odds Ratio (OR) for each
problem below.
• Note that all data in the following problem is hypothetical, and the information below is
intended only to help illustrate how to calculate Odds Ratios.
• Refer to Handout 12.4: Measures of Effect in Cohort and Case-Control Studies as
needed (from previous Session)
Problem 1
• Calculate Odds Ratio using the following data.
• Interpret the result (strength of association)
Distribution of Cases and Controls According to Bottled Water

Consumption and occurrence of diarrhoea disease in a Case-Control Study
Case Control Total
Exposed 20 5 25
Not exposed 5 20 25
Total 25 25 50
Problem 2
Distribution of Cases and Controls According to Bottled Water Consumption and

occurrence of diarrhoea disease in a Case-Control Study
Case Control Total
Exposed 10 20 30
Not exposed 5 10 15
Total 15 30 45
Problem 3

Case Control Total
Exposed 10 15 25
Not exposed 20 15 35
Total 30 30 60
Worksheet continued on next page.
Problem 4

Case Control Total
Exposed 30 5 35
Not exposed 5 30 35
Total 35 35 70
Problem 5
Case Control Total
Exposed 15 20 35
Not exposed 20 15 35
Total 35 35 60
Session 14: Cohort Studies
Learning Objectives
• Describe cohort studies
• Explain advantages and disadvantages of cohort studies
• Discuss measures of effect from a cohort study
• Compare case-control studies and cohort studies
Introduction to Cohort Studies

• The term ‘cohort’ was originally a Roman military term: a cohort was one tenth of a
platoon.
• The word ‘cohort’ has come to be used in epidemiology to assign individuals sharing
similar experiences, which we observe in order to learn about the occurrence of disease.
• These studies involve assembling health cohorts which are followed forward in time for
development of disease.
• A cohort is the group of persons who share a common experience within a defined time
period. For example:
o A birth cohort consists of all persons born within a given period of time.
o A marriage cohort would consist of all persons married within a certain period of
time.
o Groups of individuals are defined on the basis of presence or absence of exposure to a
suspected risk factor for a disease.
• At the time exposure status is defined, all potential subjects must be free from the disease
under investigation and eligible participants are then followed over a period of time to
assess the occurrence of that outcome.
• Below is a list of hypothetical studies which illustrate the main characteristics of cohort
studies:
o Example 1: Alcohol and Coronary Heart Disease (CHD)
Identify study population and then interview to obtain information about alcohol
consumption. Classify people into alcohol drinkers and non-alcohol drinkers.
Follow them up for a period of time (i.e., 5 years) and identify those who develop
CHD among both groups and compare the incidence rates.
o Example 2: High-risk Behaviour and HIV infection
Identify a high-risk behaviour that is of interest for investigators (i.e., multiple sex
partners, infection with STIs, inconsistent condom use, etc.). After identifying the
study population, classify people into either ‘high risk’ group or ‘low risk’ group.
Then, follow these groups for a specific time period. At the end of follow-up,
identify those who will have become HIV-infected in the groups and compare the
incidence rates.
o Example 3: Use of Oral Contraceptives (OC) and Obesity
To investigate the association between use of OC and obesity, investigators select
users of OC and comparable non-users of OC and invited them to participate in
the study. Consenting subjects are weighed at enrollment and later yearly for 3
years. At the end of the study the proportion of subjects who were detected to be
obese in the two groups can be compared.
Session 14: Cohort Studies 131
Key Characteristics of Cohort Studies
• Start with ‘healthy’ subjects
o Before they develop disease of interest, and/or who are free of the disease of interest.
• Classify them on the basis of exposure or risk factor.
• Follow subjects and assess the occurrence of outcome of interest among exposure groups.
• Compare incidence of outcome in both groups.
Selection of Cohorts for Study
Issues for Consideration in Assembling the Cohort

• The group of individuals selected to form a cohort can come from a variety of sources.
• The choice of a particular group will depend on the following scientific and feasibility
considerations.
Frequency of Exposures under Study

• For relatively common exposures (e.g. cigarette smoking or coffee drinking) a
sufficiently large number of exposed individuals could probably be identified from a
number of possible populations.
• For rare exposures, however, such as those related to particular occupation in specific
locations, it is more efficient to choose a group specifically because they have undergone
some unusual exposure or experience, the effects of which are to be evaluated.
• These allow obtaining sufficient numbers of exposed group within the shortest possible
time.
• The key issue is to have variability of exposure with sufficient numbers in various
exposure groups
• Accessibility of cohort members for measurements and follow-up is critical because of
the need to obtain complete and accurate information. Therefore, studies should be
conducted in populations where loss to follow-up is likely to be minimum.
• Incomplete follow-up affects validity of the study.
• The cohort selected should assure sufficient number of outcomes.
• An attempt should be made to make the results of the study generalizable.
• This requires assembling of the cohort which is representative of the population where the
study is being conducted.
Selection of Comparison Group

• Once the source of exposed subjects has been determined, the next consideration is the
selection of an appropriate group of non-exposed individuals.
• The major principle underlying this decision is that the groups being compared should be
as similar as possible with respect to all other factors that may be related to the disease
except the exposure under investigation.
• This means we expect to observe the same disease rates in the two exposure groups if
there is no real association between exposure and disease.
• When several risk factors (or exposures) are being considered simultaneously, the non-
exposed group should be defined as those with none of the risk factors (or exposures)
under evaluation.
• In many cohort studies, it may be useful to have multiple comparison groups, especially
when no single group appears sufficiently similar to those who are exposed to provide
assurance about the validity of the comparison.
o For Example: In evaluating potential adverse health outcomes associated with the use
of Oral Contraceptives (OC), choice of an appropriate comparison group is of critical
importance.
Some studies have used a comparison group of women not using OC, while others
have selected a group using some form of contraceptive other than OC.
Women not using contraception may differ substantially from users of any type of
contraception in terms of ability or desire to become pregnant or the nature of
their sexual practices.
On the other hand, women using various forms of contraceptives are likely to
differ from OC users with respect to religion, socioeconomic status, and other
lifestyle factors.
o Thus, it may be that no one comparison group is clearly superior and information of
the role of OC and disease can best be explained by comparing the results from
various comparison groups.
• Consistent results from multiple comparison groups reinforces the believe that the
exposure under observation is related to outcome or disease under investigation.
Selection of Subjects
These can be selected based on the following options:
• Random samples of populations
o Not useful when exposure is rare
• Special exposure groups (e.g. atomic bomb survivors)
o Groups with high exposure prevalence
o Useful for occupational or environmental exposures
• Within special information groups (e.g. doctors, students)
o Groups with high anticipated quality of information
Measures of Effect in a Cohort Study

Data Analysis
• Basic analysis of data from a cohort study involves the calculation of rates of the
incidence of a specified outcome among the cohorts under investigation.
• The basic layout of data at the end of the study is in the form of a 2 x 2 table as shown
below.
• Key measures include cumulative incidence and relative risk.
• Cumulative Incidence: The probability that a particular event (i.e., occurrence of a
disease) has occurred before a given time. It is a measure of disease frequency.
o Calculated by number of new cases during a period divided by the number of subjects
at risk in the population at the beginning of the study
Figure 1: Obtaining Incidence Rates and Relative Risk from a Study

Diseased Not diseased Total
Exposed a b a + b = Ne
Unexposed c d c + d = No
Total a+c b+d a+b+c+d
• a + b = Ne
• c + d = No
• Ie = Cumulative Incidence of disease among exposed
Ie = a (a + b ) = a
Ne
• Io = Cumulative Incidence of disease among unexposed
Io = c (c + d ) = c
NO
• Relative risk is the measure of strength of the association between exposure and disease.
• It measures the risk of a disease in exposed groups when compared to the unexposed
group.
• Interpreting Relative Risk:
o RR = 1
a relative risk of one (1) shows no increased risk in the exposed group (i.e., no
association between exposure and disease)
o RR > 1
a relative risk greater than one (1) indicates increased risk among exposed group
o RR < 1
A relative risk less than one (1) indicates a decreased risk in the exposed group.
Instructions
Tutor will provide example of calculating cumulative incidence of disease among exposed
and unexposed, and of calculating relative risk. Follow along with the calculations.
Example
Figure 2: Relationship Between Diabetes and Obesity
Diabetes Not Diabetic Total
Obese 773 27 800
Non Obese 227 33 260
Total 1000 60 1060
Cumulative Incidence of disease among exposed:

Ie = 773/800 = 0.96
Cumulative Incidence of disease among unexposed:

Io = 227/260 = 0.87
Relative Risk
0.96/0.87= 1.1
Risk Difference (RD)/Attributable Risk (AR)

• There is another measure of association which is commonly used, called Risk Difference
(RD) or absolute difference
• Risk difference estimates the amount of the disease which is caused by certain exposure
(sometimes called ‘Attributable Risk’).
o It specifies the amount of disease which can be reduced from the study population if
certain exposure is removed.
• RD is the difference of the incidence rate in the exposed and that in the unexposed group.
• Ie = Incidence of disease in exposed
Io = Incidence of disease in unexposed
• Risk Difference:
RD = Ie - Io
Instructions
You will work in small groups to answer Question 1, 2, or 3 in the worksheet.
Refer to:
• Worksheet 14.1: Calculate Relative Risk (RR)
Advantages and Limitations of Cohort Studies
Advantages of Cohort Studies

• Can explain temporal relationship between exposure and disease. This is possible as the
cohort is classified in relation to exposure to the factor before the disease develops.
• Development of disease after exposure to a certain factor may indicate causal
relationship.
• Permits direct measurements of exposure specific incidence of the disease. Therefore, the
absolute difference in disease incidence rates between groups (Attributable Risk) and also
the true relative risk can be measured.
• Allows for evaluation of multiple outcomes from/related to the same exposure.
o For example, although prospective studies of smokers and non-smokers were
originally designed to detect association of smoking with lung cancer, they also
showed that smoking is associated with the development of several additional
ailments such as emphysema, coronary heart disease, peptic ulcer, cancer of
oesophagus, and cancer of urinary bladder.
• Particularly useful when exposure is rare.
• Less prone to selection bias for prospective cohort studies.
Limitations of Cohort Studies

• Expensive and time consuming and therefore takes a long time before completion.
o This is true for most prospective cohort studies.
• Liable for attrition or loss to follow-up among subjects in the study.
• This can affect validity of the results obtained.
• Inefficient for rare diseases.
• Not suitable for diseases with long latency period, as these need long follow-up periods.
Comparison between Case Control and Cohort Studies
Figure 3: Comparison Between Case-Control and Cohort Studies
Characteristic Case Control Studies Cohort Studies
Sample size Small Large
Costs Less More
Study time Short Long
Rare disease Advantage Disadvantage
Rare exposure Disadvantage Advantage
Multiple Good in studying multiple exposure Not good to study when there are
exposures (Advantage) multiple exposures (Disadvantage)
Multiple Not good to study when there are Good in studying multiple
outcomes multiple outcomes (Disadvantage) outcomes (Advantage)
Take-Home Assignment
Activity: Take-Home Assignment: Practice Calculating Odds Ratio and Relative Risk
Instructions
Refer to:
• Worksheet 14.2: Homework Assignment: Practice Calculating Odds Ratio and
Relative Risk
You will work in small groups to do the following:

• Complete Question 1 and Question 2 in the worksheet.
• Refer to class materials/notes, and also to Handout 12.4 for reference as needed.
• Be sure that all group members participate in the problem-solving and discussion.
Key Points
• Cohort studies are good at studying an exposure that has multiple outcomes.
• It is easy to explain temporal relationship in the cohort studies
• The measure of risk for cohort studies is relative risk (RR)
• Risk Difference is a good measure when you want to find how the disease will be reduced
if the exposure is removed in a population.
• There a some limitations to conduct cohort studies
Evaluation
• What is ‘risk difference’ and how is it calculated?
• What are the six advantages of cohort studies?
• Mention the six limitations to cohort studies?
References
• Greenberg, R.S., et al. (1993). Medical epidemiology. East Norwalk, CT: Appleton Lange.
Saunders.
Worksheet 14.1: Calculate Relative Risk (RR)
Instructions
• Work in small groups to complete the problems in this worksheet during class.
• The tutor will assign you Question 1, 2, or 3 to begin with.
• After you have finished the question assigned to your group by the tutor, continue
working on the next question in the worksheet.
• You will have approximately 15 minutes to complete your work.
Question 1
In a study of exposure of dye from a certain industry and occurrence of cancer of urinary
bladder the following results were obtained.
Exposure to Dye and Occurrence of Cancer of Urinary Bladder
Exposure status Diseased Non Diseased Total
Exposed to dye 20 5 25
Not exposed to dye 5 20 25
Total 25 25 50
Work together to:
a) Calculate Relative Risk (RR)
b) Calculate the Risk Difference (RD)
c) Interpret the results
Question 2
A study was conducted among workers and surrounding communities in a mountainous
region of Tanzania to investigate exposure to iodized salt and occurrence of thyroid goitre.
The results are presented below:
Occurrence of Thyroid Goitre and Exposure to Iodized Salt
Exposed to iodized salt 100 120 220
Not exposed to iodized salt 100 100 200
Total 200 220 420
Work together to:

Question 3
A study was conducted among workers in a sheet metal factory to investigate exposure to
asbestos dust from the sheet metal industry and occurrence of lung cancer. The results are
presented below:
Occurrence of Lung Cancer and Exposure to Asbestos Dust
Exposed to Asbestos 100 100 200
Not exposed to Asbestos 100 120 220
Total 200 220 420
Work together to:
Worksheet 14.2: Calculating Odds Ratio And Relative Risk
Instructions
• Work in small groups to complete the worksheet as homework.
• Refer to class materials/notes, and also to Handout 12.4: Measures of Effect in Cohort
and Case-Control Studies (from Session 12) for reference as needed.
• Be sure that all group members participate in the problem-solving and discussion.
• Write down all of your work, and note any questions/challenges you have along the way.
• Submit your assignment to the instructor at the next class period. Be sure all group
members’ names are recorded on the work you submit.
• If there is class time available, the tutor may discuss this assignment in plenary, and your
group may be asked to share the answers.
Question 1
A case-control study was done to determine the association between the use of aspirin and a
suspected adverse effect. 200 cases and controls each were recruited. Among the cases, 190
had used aspirin before, while it was 130 for the controls.
Work together to:

a) Present the results in a two by two table
b) Calculate Odds Ratio
Question 2
In a prospective cohort study to determine the risk of exposure to arsenic and squamous cell
carcinoma of skin, 600 non-diseased people were involved in the study. Among them 300
were exposed to arsenic metal and the other 300 were not. After a period of 10 years of
follow up, 150 people among those who were exposed had developed squamous cell
carcinoma while only 20 people developed squamous cell carcinoma among those who were
not exposed to the metal.
Exposure to Arsenic and Squamous Cell Carcinoma in a Cohort Study

Exposure status Diseased Non diseased Total
Exposed to arsenic 150 150 300
Not exposed to arsenic 20 280 300
Total 170 430 600
Using the above information, work together to:

a) Calculate Relative risk (RR) of developing squamous cell carcinoma of the skin
b) Interpret the RR
c) Calculate the Attributable Risk or Risk Difference for the disease
d) Interpret the AR/RD
Session 15: Testing and Screening of a Disease
Learning Objectives
• Define the concepts of testing and screening
• Identify types of screening and their specific aims
• Describe the measurement properties of a screening test
• Explain criteria for initiating a screening programme
• Identify types of accepted screening methods for specific diseases and target populations
Introduction to Testing and Screening
Definitions of Screening
• Screening:
o The examination of asymptomatic people in order to classify them as likely or
unlikely to have the disease of interest.
o The presumptive identification of unrecognized disease or defect by the application of
tests, examinations, or other procedures, which can be applied, rapidly to sort out
those who probably have a disease from those who probably do not.
What Does Screening Mean?

• The application of a medical procedure or test to people who as yet have no symptoms of
a particular disease, for the purpose of determining their likelihood of having the disease.
• Early detection of disease (earlier than would usually occur in clinical practice) that is
carried out in the hope of improving diagnosis.
• The screening procedure itself does not diagnose the illness.
• Those who have a positive result from the screening test will need further evaluation with
subsequent diagnostic tests or procedures for purpose of confirmation.
• Success of screening program depends on:
o Disease experience
o Characteristics of the screening procedure
o Effectiveness of early treatment
Examples of Screening Tests

• Pap smear
• Mammogram
• Clinical breast exam
• Blood pressure determination
• Cholesterol level
• Eye examination/vision test
• Urinalysis
Session 15: Testing and Screening of a Disease 141
Types of Screening, Specific Aims, and Measurement Properties
Types of Screening Methods and their Specific Aims

• Mass screening: Screening a whole population.
• Multiple or multiphase screening: Involves the use of a variety of tests on the same
occasion for the same condition.
• Targeted screening: Screening of groups with specific exposures.
• Case-finding or opportunistic screening: Screening of patients visiting a healthcare
delivery point for some other purpose.
Measurement Properties of a Screening Test

• Ideally, a screening test will correctly identify all those with the disease under
investigation and exclude this disease among all non-diseased.
• Qualities of a good screening test:
o Accuracy
The test should give true measurement of the attribute under investigation.
Accuracy is a measurement process yielding values that are equal on average to
the true underlying value for the diagnostic variable being measured.
o Precision/Reproducibility
Consistent results in repeated trials
Precision is the degree to which a series of measurements fluctuates around a
central measurement.
The central value may or may not be the true values of the variable. Precision is
independent of accuracy.
Measurements may vary from inaccurate and imprecise to inaccurate but precise,
or to accurate but imprecise to the ideal measurements, which should be accurate
and precise.
o Validity
High sensitivity and specificity for the disease in question
Describing the Performance of a New Diagnostic Test

• Often one is faced with the task of evaluating the merit of a diagnostic test.
• Assessing a new diagnostic test begins with the identification of a group of patients
known to have the disorder of interest, using an accepted reference test known as the
‘gold standard’.
• The ‘gold standard’ has the following limitations:
o The gold standard is often the most risky, technically difficult, expensive or
impractical of the available diagnostic options.
For example, post-mortem brain biopsy, the gold standard for the diagnosis of
Alzheimer’s disease
o For some conditions, no ‘gold standard’ is available.
For example, angina pectoris
o Comparisons with an imperfect gold standard my lead to the erroneous conclusion
that the new test is worse, when in fact it is better. For example:
If the new test detects diseased individuals more accurately than the gold
standard, these patients will be mistakenly labeled false positives.
If the new test is negative in more disease-free individuals these patients will be
mistakenly labeled false negatives.
• The results of a screening test and disease status can be examined conveniently by use of
the four-fold contingency table.
Figure 1: Four-fold Contingency Table on How to Analyse a Screening Test

DISEASE STATUS
Screening Test Results Positive Negative Total
Test Positive (T+) True positive (TP) False Negatives (FN) Total Test Positive
Test Negative (T-) False positive (FP) True Negative (TN) Total test Negative
Total Disease Total Disease Total
Total
Positive (D+) Negative (D-) Population(N)
• From a population of N people, D+ has the disease, while D- does not have the disease.
• The prevalence of disease in this population can be represented as:
D+
N
• In this population:
o T+: persons are positive on the screening test, while;
o T-: Persons are negative on the screening test.
o True Positives (TP): Diseased individuals with a positive screening test
o False Positives (FP): Healthy individuals with a positive screening test
o False Negatives (FN): Diseased individuals with a negative screening test
o True Negatives (TN): Healthy individuals with a negative screening test
• An ideal screening test has as few false positives and false negatives as possible.
Sensitivity and Specificity of a Test

• The capacity of a screening test to identify correctly the diseased against the non-diseased
is expressed by its sensitivity and specificity, which are a measure of the validity of a test.
• Sensitivity: The ability of a test to give a positive finding when the tested person truly
has the disease under study.
o It is the probability that the test will be reactive in a diseased individual.
• A test with a high sensitivity will detect a high percentage of diseased individuals.
True Positives (TP)

Sensitivity = × 100%
Total Disease Positive (D+)
• Specificity: The ability of a test to give a negative finding when the tested person is truly
free of the disease under study.
o It is the probability that a test result will be non-reactive in an individual who is not
diseased.
True Negatives (TN)
Specificity = × 100%
Total Disease Negative (D-)
• In practice, sensitivity of a test is usually determined in a group of proven cases of a given

disease, while its specificity is determined in a group of healthy people, or people with
diseases other than the one under investigation.
• Sensitivity and specificity are test characteristics and are not very much influenced by
characteristics of the population (such as age, sex, etc.)
• However, the pattern and prevalence of other diseases will influence specificity of the
test. The mix of sub-clinical and clinical cases in a population might affect the sensitivity
of a test.
Predictive Value of a Test

• When interpreting a screening test result, a sign or symptom under practical
circumstances, it is important to measure its diagnostic accuracy in the population in
which it will be used.
• This is the predictive value, which is used to measure the accuracy of a test in classifying
individuals as diseased or not diseased when the test is positive or negative, respectively.
Predictive Value of a Positive Test (PVP)

• PVP: The proportion of diseased amongst all positive tests.
o It signifies the probability that a given patient with a positive test indeed does have the
disease.
True Positives
PVP = × 100%
Total test positives
• The predictive value is influenced by the sensitivity and specificity of the test, as well as
the prevalence of the disease in the population.
Predictive Value of a Negative Test (PVN)

• PVN: The proportion of people with negative test results who are correctly diagnosed.
o It signifies the probability that a given patient with a negative test indeed does not
have the disease.
True Negatives
PVN = × 100%
Total test negatives
• The following table illustrates the influence of prevalence on predictive value, and the
independence of sensitivity and specificity on prevalence:
Figure 2: Influence of Prevalence on The Predictive Value and Accuracy of Screening Test
LOW PREVALENCE (10%) HIGH PREVALENCE (50%)

DISEASE DISEASE
Total Total
Positive Negative Positive Negative
Test Test
9 9 18 45 5 50
positive positive
Test Test
1 81 82 5 45 50
negative negative
Total 10 90 100 Total 50 50 100
• In both cases sensitivity and specificity are 90% but the predictive value of a positive test
is different and depends on the prevalence of the disease in the study sample.
LOW PREVALENCE (10%) HIGH PREVALENCE (50%)

9 45
× 100 = 50 % × 100 = 90 %
18 50
• Which cut-off point one wishes to choose, depends on several considerations:
o Is it harmful or serious to miss case?
If this is true, then choose a value for the positivity criterion that minimizes the
false negatives. Sensitivity should be high, usually at the expense of specificity.
(For example, neonatal phenylketonuria screening).
o Is treatment of risky? (Does it risk serious side-effects, operative mortality, etc.?)
If this is true, then the number of false positives should be low, and specificity
high (usually at the expense of sensitivity).
This also applies when a false positive diagnosis may have deleterious effects on a
patient’s lifestyle, self-image, or financial situation (For example: AIDS, mental
illness, or learning disorders).
o Is a highly specific and sensitive confirmatory test available?
If this is true, then aim for a very high sensitivity in the preliminary (screening)
test.
• Sensitivity is always inversely proportional to specificity. By increasing sensitivity one
has to accept a loss in specificity and vice-versa.
Serial/Consecutive Testing
• In serial testing, we first apply one test to a certain population and all people identified as
having a positive test are then submitted to a second test.
• This is called re-testing of reactive individuals with another test.
• For example, consider the Veneral Disease Research Laboratory (VDRL) test as a
screening test for syphilis, and the TPHA (Treponema Pallidum Haemaglutination Assay)
as a confirmatory test.
• We will assume a population with a prevalence of syphilis of 20%.
Figure 3: First Serial Test for Screening of Syphilis Disease (by VDRL)
Syphilis (by clinical diagnosis)

VDRL Positive Negative Total
Positive 180 80 260
Negative 20 720 740
Total 200 800 1000
• Predictive Value Positive = (180/260) × 100 = 69.2%

• Sensitivity = (180/200) × 100 = 90%
• Specificity = (720/800) × 100 = 90%
• In serial testing, the 260 people who tested positive on the VDRL will then be subjected
to a second series of testing with TPHA, which is known to have the following test
characteristics: 95% Sensitivity and 99% Specificity.
Figure 4: Second Serial Test for Syphilis Disease (by TPHA)
Syphilis (by VDRL)

TPHA Positive Positive Negative Total
Positive 171 0.8 171.8
Negative 9 79.2 88.2
Total 180 80.0 260
• Positive Predictive Value = (171/171.8) × 100 = 99.53%
• Sensitivity = 95%
• Specificity = 99%
• Overall sensitivity = (180 – 9) /200 = 171 / 200 = 85.5 %
• Overal specificity = (720 + 79.2) /800 = (799.2/800) × 100 = 99.9
• Overall positive predictive value = (171/171.8) × 100 = 99.53%
• When using tests in series, the overall sensitivity does decrease, while overall specificity
increases.
• Most importantly, the predictive value always improves because the prevalence is
increased for the second test.
o In this example, the prevalence increased to 69.2%. [(180/260) × 100 = 69.2%]
• In general one can use the following general formulae to calculate overall sensitivity and
specificity:
o Overall Sensitivity = (Sensitivity A × Sensitivity B)
o Overall Specificity = (1 – Specificity A) × (1 – Specificity B)
• As far as sensitivity and specificity are concerned, it does not matter in which order the
tests are carried out, but efficiency and total cost may differ considerably.
• In general, one benefits most from serial testing if the most sensitive test is used for
screening, and the most specific test as confirmatory.
Summary
• Serial tests result in:
o Lower sensitivity (higher False Negative Rate)
o Increased specificity
o Increased PVP (lower False Positive Rate)
Criteria for Initiating a Screening Programme

Disease
• Should be serious and of public health importance
• The natural history should be well understood
• There should be a long period between first sign and overt disease
• High prevalence of pre-clinical stage
Diagnostic Test
• Sensitive and specific
• Simple and inexpensive
• Safe and acceptable to both the public and the profession
• Reliable
Diagnosis and Treatment

• Facilities should be available and adequate
• Treatment should be effective, safe, acceptable and available
• There should be an agreed-upon policy on whom to treat as patients, including
management of borderline cases.
• Costs of a screening programme must be balanced against the number of cases detected
and the consequences of not screening. The cost should also be economically balanced in
relation to the total expenditure on medical care.
Biases in Evaluating Screening Programmes
Lead-time Bias
• Is defined as the interval between the diagnosis of a disease at screening and when it
would have been detected due to development of symptoms.
• This occurs because diseases with a long preclinical phase are more readily detected than
rapidly progressing cases with a short preclinical phase.
• A program may seen as successful when in fact observed differences in mortality were a
result merely of the detection of less rapidly fatal cases through screening, while those
more rapidly fatal are diagnosed after development of symptoms.
• If time to outcome (e.g. death) is measured from point of diagnosis, early diagnosis will
increase the time to outcome (e.g. the length of survival) by the interval between
diagnosis by screening, and when diagnosis would have occurred by ordinary means.
• This can make early diagnosis appear to increase survival time, even when it has had no
effect, or may even have a damaging effect.
Figure 5: Representation of Lead Time Bias in Screening of a Disease
Length-time Bias
• Length-time bias is the over representation among screen-detected case of those with a
long preclinical phase and hence a favorable prognosis.
• The proportion of slowly progressing disease picked up by screening will be greater than
the proportion picked up by standard clinical practice, since rapidly progressing disease
will tend to become symptomatic more quickly.
• Therefore, patients diagnosed by screening will progress more slowly than those
diagnosed by conventional means, even if early treatment has no impact.
Compliance Bias
• Volunteers for screening are generally more health conscious/concerned than the general
population. They tend to assume greater responsibility for their own care, and are often
more likely to comply with therapy.
• Groups detected by screening may do better than others, not because early treatment
matters, but because they comply with treatment more than others.
o This is not due to early detection by screening, but to the same reason that made them
volunteer for screening in the first place. It is a ‘volunteer’ bias.
Examples of Generally Accepted Screening Programs

• The following are some examples of generally accepted screening methods, the diseases
they screen for, and the target populations for screening:
Test Disease Target group
Weight Malnutrition Under-fives
Hb Hookworm Under-five
Sickle cell disease Patients with Malaria
VDRL Syphilis during pregnancy Pregnant women
Widal Typhoid fever Food handlers
Take Home Assignment
Activity: Take-Home Assignment
Instructions
Refer to Worksheet 15.1: Calculate Sensitivity, Specificity, Positive and Negative
You will do the following:

• Work individually to complete the worksheet.
• Refer to class materials/notes for reference as needed.
• Prepare to submit your work before the next session.
Key Points
• Screening is defined as examination of asymptomatic people in order to classify them as
likely or unlikely to have the disease of interest.
• Types of screening are: mass screening, multiple or multiphase screening, targeted
screening, case-finding or opportunistic screening.
• The measurement properties of a screening test are accuracy, reproducibility, validity.
• Sensitivity is the ability of a test to give a positive finding when the tested person truly
has the disease under study.
• Specificity is the ability of a test to give a negative finding when the tested person is
truly free of the disease under study.
• The predictive value of a positive test (PVP) means the proportion of diseased amongst
all positive tests.
Evaluation
• What are the four types of screening?
• What are the measurement properties of a screening test?
• What are screening criteria for initiating a program for screening?
• Mention any five (5) accepted screening methods for disease and target population?
References
• Greenberg, R.S., et al. (1993). Medical epidemiology. East Norwalk, CT: Appleton
Lange.
Saunders.
Worksheet 15.1: Calculate Specificity, Positive and Negative
Instructions
• Individually complete the worksheet as homework.
• Refer to class materials/notes, and also to Handout 12.4: Measures of Effect in Cohort
and Case-Control Studies (from Session 12) for reference as needed.
• Write down all of your work, and note any questions/challenges you have along the way.
• Submit your assignment to the instructor at the next class period.
• If there is class time available, the tutor may discuss this assignment in plenary, and you
may be asked to share your answers.
Question 1
In a cervical cancer screening programme, pap smears were collected from 12,350 women
aged between 25-45 years of age. Among these women, 1250 were known to have pre-cancer
lesions. All pap smears were processed in reputable cytology laboratory and 1235 (650 from
women with pre-cancer lesions) were reported to be abnormal.
Work together to:
a) Put the above data in a table
b) Calculate the sensitivity and specificity of the pap smear
c) Calculate the predictive value positive (PVP) and predictive value negative (PVN) of the
pap Smear
d) Would you recommend that the pap smear be used for cervical screening in this
population?
Session 16: Control of Epidemics
Learning Objectives
• Define the terms epidemic, endemic and pandemic
• Identify different types of epidemics
• Identify steps taken during an investigation of an epidemic
• Explain principles of outbreak/epidemic investigations
Introduction to Disease Epidemics

• The magnitude of a particular disease present in a specific population may remain stable
for long period of time or it may alternatively rise and fall due to fluctuations in the
number of susceptible individuals and the nature and extent of their exposure to disease
agents.
Endemic Disease
• Diseases which are continuous and/or habitually transmitted in populations throughout
the year (such as malaria)
• Endemicity denotes the habitual presence of a disease in a community.
Epidemic Disease
• Epidemic: The occurrence of more cases of a specific disease in a population that is
clearly in excess of the expected incidence in a specified period of time.
o The number of cases that constitute an epidemic will vary with the type of disease.
In some epidemic-prone diseases such as cholera and poliomyelitis, one case is
considered an epidemic.
o In order to say that there is an epidemic, it is necessary to know the level of
endemicity of the disease.
In the USA a disease such as malaria one case is an epidemic since malaria is
already eradicated in USA.
Diseases like Ebola do not occur habitually in human populations. A single case
will constitute an epidemic in any part of the world (such as the outbreak in Zaire
in May 1995).
o In Tanzania, it is important to know the average acceptable numbers of cases for
endemic diseases (such as malaria) from a prior year’s records before deciding that an
epidemic is occurring.
Pandemic Disease
• This is expressed when an epidemic spreads to affect many countries globally.
o Modern epidemiology arose out of the study of so-called ‘classical epidemics’, such
as plague, smallpox, cholera, typhus, typhoid fever and dietary deficiencies.
o Some of these epidemics remain an important threat to many tropical countries.
Frequently Encountered Disease Epidemics

• Poliomyelitis
• Measles
• Mumps
Session 16: Control of Epidemics 151
• Rubella
• Hepatitis A
• Streptococcal infections
• Meningococcal meningitis
• Food poisoning
• For Tanzania and other tropical countries the most important cause of epidemics are
infectious diseases. For other countries (Iraq, Pakistan, Guatemala, etc.) it is also
important to consider road accidents, drug addiction, poisoning, etc. as epidemics that can
affect mortality and morbidity.
o Poisoning and neurological disability epidemics have been reported as a result of
ingestion of wheat products treated with methyl- and ethyl-mercuric compounds. The
wheat was intended only for use as seed and was so treated to prevent fungus growth.
o Other disease outbreaks involving the nervous system (Konzo) have been reported
from Mozambique and Tanzania and were later found to be associated with the
consumption of certain types of cassava with high content of cyanide.
Epidemics of Emerging/Re-Emerging Diseases

• ‘New’ diseases such as Lassa fever and Legionnaire’s or Veteran’s disease continue to
pose problems from time to time in certain parts of the world. HIV/AIDS is also now
recognized as a world-wide problem.
o Lassa fever: a viral disease transmitted from rodents and was first recognized in 1969.
Where three nurses contracted it in Nigeria and two of them died.
o Ebola, which is a viral disease, was first recognized in Southern Sudan and Zaire in
1976. Subsequently it was found in Southern Sudan in 1979. An Ebola virus
Haemorrhagic fever outbreak has also been reported in Zaire in May 1995.
o Legionnaire’s disease: outbreak of pneumonia at a convention of the American
Legionnaires in Philadelphia in 1976. There were 29 deaths. A gram-negative
bacillus was identified as the causative agent Legionella pneumophila.
o AIDS: An immune deficiency disorder (Acquired Immune Deficiency Syndrome)
brought about by infection with the Human Immunodeficiency Virus (HIV). It was
described for the first time in 1981 among men who have sex with men (MSM) and
intravenous drug users in the USA. Later the epidemic was found to spread among
heterosexual populations in Africa and elsewhere. The disease has a long incubation
period (5-10 years) and is transmitted through sexual contact, blood transfusion and
unsterile skin piercing instruments (e.g., needles) including injections. Vertical
transmission from mother-to-child is also an important route.
Types of Epidemics
Common Source
• Occurs when a group of people are exposed to the same causative agent.
• If the period of exposure to the agent is brief and essentially simultaneous for all persons
contracting the disease, the epidemic is called a ‘point source epidemic’.
• All persons are affected by the same source and person-to-person transmission does not
occur.
• Common source epidemics are not necessarily caused by infectious agents; they may also
result from common exposure to noxious agents in the environment. Examples include:
o The Bhopal disaster: a large industrial catastrophe occurring in 1984 at the Union
Carbide India Limited pesticide plant. Over 500,000 people were exposed to harmful
gas and toxins leaked. Chemicals continue to pollute groundwater in the area.
o The Chernobyl disaster: A nuclear accident that occurred in 1986 in the Ukraine. A
series of explosions occurred in one of the nuclear reactors, and radioactive materials
polluted the surrounding areas.
o Other examples might include children swimming in a chemically polluted river or
factory workers exposed to extreme heat or volatile chemicals.
Propagated (Progressive) Epidemic

• Propagated epidemics result from transmission of an infectious agent from an infected
host to a susceptible one.
• The transmission can either be direct (e.g. infectious hepatitis or measles) or indirect
through a vector, as in malaria and yellow fever.
• Transmission of the infecting organism continues until the number of susceptible
individuals is depleted, or until susceptible individuals are no longer exposed to infected
persons or intermediary vectors.
• There are three important aspects of person-to-person transmission of disease, and they
include:
o Generation time: The time interval between receipt of infection and maximal
infectivity for both clinical and subclinical infection
o Herd immunity: The decreased probability of a group of people or community to
develop an epidemic upon introduction of an infectious agent. The decreased
probability is due to the presence of a high proportion of immunes although there may
be a certain number of persons who are individually susceptible to the agent.
o Secondary Attack Rate: The proportion of contacts who get a communicable disease
as a consequence of contact with the index case within the accepted incubation period.
Secondary Attack Rate Formula

(No. of new cases in a group) – (initial/index cases in a period)
× 10k
(no. of susceptible persons in a group) – (initial cases)
Common Source Epidemics vs. Propagated Epidemics

• The curve of onsets for a common source epidemic shows a rapid rise and fall within one
incubation period, whereas new cases in a propagated epidemic continue to develop
beyond one incubation period.
• If you look at the epidemic curve you can see that those affected have different times of
onset of symptoms.
• This is because of individual differences in the level of immunity, and the exposure to
different doses of the infective agent.
• In the curve of a propagated epidemic, usually a gradual rise to a peak may be observed,
followed by a gradual fall in the number of new cases.
o This is because, as the number of cases increases, the number of susceptible falls
below a critical level so that the number of new cases begins to fall.
• The shape of the epidemic curve in this type of epidemic reflects several factors including
the population size and composition, the proportion of susceptible in the population, the
number of cases at the start of the epidemic, the contact rate between the infected persons
and the susceptible individuals, the infectivity or pathogenicity of the disease agent and
the incubation period of the disease.
• Sometimes it may be difficult to identify the nature of an epidemic from the shape of the
epidemic curve alone.
• The typical common source epidemic curves may be affected by the continued
development of cases through persistent contamination of the source, or exposure
occurring repeatedly or by a long and variable incubation period.
o The shape of the curve may also vary depending on the size of the population
exposed, the type of source distribution and the extent of use or the extent of contact
with the susceptible population.
o The typical shape of a point source epidemic may be modified by presence of more
than one disease agent, each with a different incubation period, or if secondary cases
(person to person transmission) follows exposure to the original point source.
o Conversely, a propagated epidemic can create a rapidly rising and rapidly falling
epidemic curve similar to that of a common source epidemic. This is especially so
when the disease has a short incubation period and is highly infectious (e.g. cholera).
Principles of Investigations
• In a clinical case, investigation and treatment must go side by side for the successful
management of the patient.
• Likewise an epidemic is always an emergency where action to counteract it must begin
even before complete investigation.
o For example, in 1854, Dr. John Snow showed that the outbreak of cholera around
Broad Street in London resulted from contamination of drinking water with
excrement from cholera sufferers. The epidemic was well managed and put under
control 30 years before the identification and description of the Vibrio cholerae, the
causative organism.
Steps Taken During an Investigation of Epidemic

• A successful investigation of an epidemic requires careful accumulation of information in
the field and careful analysis of the data as well as the ability to make relevant
observations.
• There is no ‘rule of thumb’ as to the order in which the steps should follow.
• Every epidemic will be assessed according to its particular circumstances.
Prepare for Field Work

• Assemble a team (rapid response team).
• Assemble relevant supplies and equipment (transport media, specimen bottles,
information/education/communication (IEC) materials, treatment guidelines & medical
supplies, transport, communication means, investigation and surveillance forms, funds,
fuel, etc).
• Alert district authorities
Verify the Diagnosis

• Review clinical findings.
• Visit patients yourself (interview and examine for symptoms and signs).
• Laboratory diagnosis.
• Choose a working case definition: who is a case and who is not (by person, place, and
time). Should be highly sensitive.
• Establish index case
o Index case: the earliest documented case of a disease that is included in an
epidemiological study
Establish Existence of an Epidemic

• Compare observed incidence with expected:
o No seasonality: compare with incidence from previous weeks/ months,
o Seasonality: compare incidence from similar periods of earlier years.
• Use action threshold.
Identify and Count Cases

• Use the working case definition
• Collect information on cases (deaths) and and create line list
o Line list is a centralized list of all cases with identifying information: name, address,
contact information
• Demographic: age, sex, tribe, etc.
• Clinical: symptoms and signs, date of onset, lab results, treatment, outcome of treatment
• Exposure and risk factor information
Data Analysis
• To describe the outbreak
o by person/population (tables, bar charts, pie charts)
o place (spot maps)
o time (histograms, graphs)
• Person: who is the population at risk (age, sex, race, occupation, medical status, etc).
• Exposure: occupation, environment, cultural practices, socio-economic factors, etc.
• Determine size of the population at risk.
o Calculate Attack Rate, Case Fatality Rate (assess quality of case management).
Formulate and Test Hypothesis

• Hypothesis should address:
o Source of the agent
o Mode of transmission
o Exposures (risk factors)
o Where resources are available and cause not obvious, compare cases with controls in
respect to exposure.
Calculate Odds Ratio (OR), chi square, and look up p-value.
o If sure of the cause, then may need only to study the cases.
Assess the Local Response Capacity

• What number and type of staff is available locally?
• Which drugs/medical supplies/ guidelines are available to treat the cases?
• What has been done in terms of epidemic response?
• What steps have been taken to interrupt transmission?
• Has any health education been conducted?
Set up immediate control measures

• Be guided by the epidemiological triangle:
o Agent
o Host
o Reservoir
• Deal with the reservoir (if any)
• Interrupt transmission.
• Reduce susceptibility of the host by vaccination, chemo-prophylaxis, improve nutrition,
etc.
• Treat cases
Address the Resource Gaps

• Done as need may arise
o Laboratory support
o Environmental support
o Public information
• Specific disease control needs in terms of:
o Personnel
o Drugs, vaccines and equipment
o Transport, communication and logistics
Report Writing
• Describe the situation using the answers and comments to the steps outlined above.
• Describe the need for outside assistance based on the gaps in resources.
• Make conclusions on the outbreak you are dealing with.
o Give recommendations on priority activities (short term, long term) based on findings
and conclusions.
Dissemination of Findings
• Convey the report to higher Ministry of Health (relevant division/program, senior/top
management)
• Disseminate report to the Council Health Management Team (CHMT).
• If epidemic has been confirmed, convey report to World Health Organisation (WHO)
through top management (i.e., MoH).
Intensify Surveillance
• Maintain contact with the district for daily updates (cases, deaths, number admitted,
number discharged, areas affected, etc) until end of the epidemic.
The Role of Different Management Levels in the Investigation and Control of

Communicable Disease Outbreaks
Role of the Ministry of Health

• Overall coordination, technical support, development and provision of guidelines, policy
in the following areas:
o Coordination: National Task Force
o Case management (procurement and provision of emergency supplies)
o Surveillance (ensure daily updates from affected districts, report to partners and other
stakeholders at national level, alert neighboring districts)
o Public information
o Environmental sanitation
o Logistics management
o Investigation: National rapid response team
Role of the District Health Team
• District level coordination, dissemination of guidelines, implementation of control
measures:
o Coordination: District Task Force
o Case management (transport emergency supplies to affected area and serving health
units, set up treatment sites).
o Surveillance (retrieves data from health units and affected communities, report to
MoH and uses the information for control).
o Public information in the communities affected.
o Environmental sanitation and preventive measures: address risk factors including
mass immunization.
o Logistics management/monitoring.
o Investigation: District rapid response team
Role of the Health Units

• Surveillance: data collection, reporting
• Case management: treatment of cases
• Follow-up of cases: home-visiting
• Health education in the health unit
Key Points
• Epidemics are usually either point source or propagated
• The purpose of investigating an epidemic is to identify its cause and the best means to
control it.
• For proper control of epidemics all steps of investigations should be done.
• Every level of health care delivery has responsibilities to take in an investigation of
epidemic.
Evaluation
• What are the steps in investigation of an epidemic?
• What are the roles of the district health team in control of an epidemic?
• Differentiate point source epidemic from propagated epidemic?
• Define the following terms; Secondary attack rate, Secondary attack rate and Herd
immunity
References
• Bonita, R., Beaglehole, R., Kjellstrom, T. (2006). Basic Epidemiology. (2nd Ed). Geneva,
Switzerland: WHO.
• Chin, J. (2000). Control of communicable diseases manual. (17th Ed.) Washington, DC:
American Public Health Association.
Session 17: Integrated Disease Surveillance and
Response
Learning Objectives
• Define the terms surveillance and Integrated Disease Surveillance and Response (IDSR)
• Define the terms standard case definition and action threshold
• Identify the priority diseases that are in the IDSR for Tanzania
• State the reporting frequency of IDSR priority diseases
• Identify the non-outbreak-related surveillance responses
Introduction to Integrated Disease Surveillance and Response

Integrated Diseases Surveillance and Response
• Surveillance: being watchful or vigilant for health problem and determinants with the
intention to take action for improvement of health of a population.
• Integrated Diseases Surveillance and Response: A strategy proposed and adopted by the
World Health Organization/AFRO Regional Assembly in 1998 to strengthen disease
surveillance in member countries using an integrated approach.
Objective of IDSR
• Broadly, the concept of IDSR is:
o To provide epidemiological evidence for use in making decisions and implementing
public health interventions for the control and prevention of communicable diseases.
• A technical definition of IDSR is:
o Surveillance includes the ongoing systematic collection, analysis and interpretation of
health data in the processes of describing and monitoring of a health event.
• The strategy aims to integrate surveillance functions at all levels and is expected to
enhance early detection, reporting, and timely response to epidemic-prone and other
priority endemic diseases.
o A priority activity of IDSR is data collection for action.
• Overall guiding principles for IDSR:
o Usefulness of data collected
o Simplicity of system
o Flexibility of system
o Integration of common activities
o Orientation to action
Aims of IDSR
• Strengthen the capacity to conduct effective surveillance activities.
• Integrate multiple systems so that forms, people and resources can be used more
efficiently and allow health staff to focus more on disease prevention, control and
reporting.
• Improve the use of information for decision making.
• Improve the flow of surveillance information between and within levels of the health
system.
Session 17: Integrated Disease Surveillance and Response 159
• Improve laboratory capacity and involvement in confirmation of pathogens and
monitoring of drug sensitivity
• Increase the involvement of clinicians in the surveillance system,
• Emphasize community participation in detection and response to public health problems,
• Improve communication:
o Between all levels of health care and public health system using data that can alter the
availability of resources and strengthen the ability of health staff to provide improved
services.
o With the public, target populations, donors, and organizations that provide similar
services. Coordination and collaboration can take place between groups, agencies and
organizations that share similar target populations for disease control objectives.
• Increase the access to and use of standard surveillance case definitions and laboratory
services for confirming suspected cases.
• Increase district-level decision making for defining, recognizing and responding to issues
and needs in the local area as well as to meeting national priorities and targets.
• Strengthen preparedness by integrating transportation, training and supervisory activities.
• Provide district level support in using epidemiological tools to detect, investigate and
respond to epidemics.
• Increase the alertness of clinicians to respond to a possible public health epidemic, even
if it is only 1 case, and help clinicians see the value of sharing information about these
cases with health staff responsible for surveillance.
• The MoH is committed to strengthen IRSR at multiple levels, including community,
health facility, district, region and national.
Standard Case Definition and Action Threshold
Standard Case Definition

• A set of standard criteria for deciding whether a person has a particular disease or other
health condition by specifying clinical criteria and limitations on time, place and person.
• By using a standard case definition we ensure that every case is diagnosed in the same
way, regardless of when or where it occurred, or who identified it.
• We can then compare the number of cases of the disease that occurred in one time or
place with the number that occurred at another time or another place.
Action Threshold
• Denotes the critical point at which action/intervention must be taken to address a disease
outbreak or epidemic. It can be expressed in terms of numbers or proportions.
Priority Diseases Under IDSR Strategy

• Communicable diseases continue to cause most of the mortality, disability and morbidity
in Tanzania.
• Increased numbers of these diseases may be influenced by reduced immunization
coverage, competing resource allocation priorities, increased transmission due to rural-to-
urban migration, poverty, overcrowding, poor sanitation and hygiene, etc.
• The MoHSW has identified 13 priority diseases to be included in the IDSR Strategy.
Figure 1: Priority Diseases Under National Surveillance Strategy
Epidemic-prone diseases Cholera

Acute Flaccid Paralysis Polio
Plague
Measles
Yellow fever
Cerebral Spinal Meningitis
Rabies/animal bite
Diseases targeted for elimination/ Acute Flaccid Paralysis
eradication Neonatal Tetanus
Measles
Diseases of public health importance Malaria
Typhoid
Pneumonia under age of five
Diarrhea under age of five
• These diseases have been selected by based on severity, importance as a burden to the
community and the preferred frequency of reporting.
• The MOHSW requires immediate reporting of all epidemics and encourages case-based
reporting and line-listing of cases during epidemics. Zero reporting should be conducted
on a weekly and monthly basis as shown in the following table.
o Zero reporting: designated reporting sites at all levels should report at a specified
frequency (e.g. weekly or monthly) even if there are zero cases during that time span.
Figure 2: Disease Reporting Schedule for Diseases under National Surveillance Strategy
Weekly Reporting: Monthly Reporting:

1.Cerebral Spinal Meningitis 1.Bacillary dysentery
2.Cholera 2.Neonatal Tetanus
3.Yellow fever 3.Pneumonia < 5 years
4.Plague 4.Diarrhea < 5 years
5.Measles 5.Malaria
6.Rabies/animal bite 6.Typhoid fever
7.Acute Flaccid Paralysis
Note: occurrence of unusual health events or emerging/re-emerging diseases (like
SARS, Avian influenza, Influenza A (H1N1), Ebola, Marburg, etc.) should be reported
to MoH immediately.
• Surveillance of diseases like tuberculosis, leprosy and HIV/AIDS are not included in the
IDSR reports but they are also diseases of public health importance. This is because they
have very strong case reporting systems in place.
Standard Case Definitions and Action Thresholds for Specific Diseases

IDSR at Community Level
• One of the objectives of improved IDSR on case-patient detection as stipulated in the
Tanzania National action plan is to develop a community-based surveillance system in
order to enhance early detection, reporting and response.
• Community involvement will also improve linkages between communities and the formal
health system.
• In each community, an inventory of community based organizations (CBOs), village
health workers, traditional healers and traditional birth attendants should be drawn.
• These groups need to be sensitized to be able to detect cases and report to the nearest
health facility.
• The flow of information from a community member can be in a verbal or written form.
• This information is passed on to the identified community leader in that particular
location who will in turn send the data to the person in charge (or any staff) at a health
facility.
Standard Case Definitions at Community Level

• Acute flaccid paralysis: Any sudden lameness in a child less than 15 years of age.
• Bacillary dysentery: Any person with diarrhoea and visible blood in stool.
• Cholera: Any person 5 years of age or more with lots of watery diarrhoea or death
occurring from passing out lots of watery diarrhoea.
• Malaria: Any person who has an illness with high fever.
• Measles: Any person with fever and rash.
• Meningitis: Any person with fever and altered consciousness.
• Neonatal tetanus: Any newborn who is normal at birth but after 2 days becomes unable to
suck or feed or any neonatal death.
• Plague: Any person with fever and painful swelling under the arms or in the groin area.
• Pneumonia in children < 5 years: Any child less than 5 years of age with rapid breathing
or difficulty in breathing.
• Rabies: Any person with history of animal bite and any of the following: mental
confusion, fear of drinking water and death.
• Typhoid: Any person with a long-standing fever and abdominal pains.
• Yellow fever: Any person with fever and yellowing of eyes or skin.
Standard Case Definitions and Action Thresholds at Health Facility Level

• Acute Flaccid Paralysis (AFP)
o Standard case definition
Any child less than 15 years of age with AFP including Guillain–Barré Syndrome
or any case of any age in whom the medical practitioner suspects polio.
o Action threshold
Single suspected case is a suspected outbreak of polio.
• Bacillary Dysentery
Any person with diarrhea and visible blood in stool, and abdominal pain and
cramps.
o Action threshold
Two or more cases in a week at a health facility is a suspected outbreak.
• Cholera
Any person 5 years of age or more who develops severe dehydration or dies from
acute watery diarrhoea.
o Action threshold
Single case is a suspected outbreak.
• Diarrhoea with some dehydration (for children 2 months–5 years of age)
o Any child with diarrhoea and two or more of the following:
Restless or irritable
Sunken eyes
Drinks eagerly
Skin pinch goes back slowly
o Action threshold
In a defined locality where it is observed that the number of cases of diarrhoea for
the period of time clearly exceeds the number of cases of the previous
year/season.
• Diarrhea with severe dehydration
o Any child with diarrhoea and two or more of the following:
Lethargic or unconscious
Sunken eyes
Not able to drink or drinking poorly
Skin pinch goes back very slowly.
o Action threshold
When it is observed that the number of cases of diarrhoea for the period of time
clearly exceeds the number of cases of the previous year/season in a defined
locality where.
• Uncomplicated Malaria
Any person having fever with or without joint pains, sweats, nausea, chills and
vomiting.
o Action threshold
When it is observed that the number of cases for that period exceed the number of
expected by 50% in a defined locality/health facility.
• Measles
Any person with history of fever, skin rash and any of the following: cough,
runny nose, red eyes.
o Action threshold
One case at a health facility is a suspected outbreak.
• Meningococcal Meningitis
Any person with sudden onset of fever (more than 38.5°C per rectal or 38.0°C
axillarly) AND any one of the following: neck stiffness, altered consciousness,
bleeding under the skin.
o Action threshold
Single suspected case is a suspected outbreak
• Neonatal Tetanus (NNT)
Any newborn with normal ability to suck or cry during the first two days of life
and who between 2nd and 28th day of age cannot suck normally and becomes stiff
or has convulsions (or both).
o Action threshold
A single suspected case needs action.
• Plague
o Standard Case Definition
Any person with sudden onset of fever, headache and painful swelling of inguinal
and axillary lymph nodes or cough with blood-stained sputum.
o Action Threshold
Single case is a suspected epidemic.
• Pneumonia (in children 2 months up to 5 years of age)
o Standard case definition:
Any child with cough or difficulty in breathing, and fast breathing:
2 months up to 12 months: 50 breathes per minute or more.
12 months up to 5 years: 40 breathes per minute or more.
o Action threshold
When it is observed that the number of cases for that period of time clearly
exceeds the number of cases of the previous year/season in a defined locality.
• Severe Pneumonia (in children 2 months up tp 5 years of age)
o Standard case definition:
Any child with cough or difficulty in breathing, and any danger signs or chest
indrawing or stridor in a calm child.
Danger signs: Not able to drink or breastfeed, vomiting everything, convulsions,
lethargy or unconscious.
o Action threshold
When it is observed that the number of cases for that period of time clearly
exceeds the number of cases of the previous year/season in a defined locality.
• Rabies
History of animal bite and the following: fever, mental confusion, fear of drinking
water, altered consciousness or death.
o Action threshold
Single suspected case is a suspected outbreak.
• Typhoid fever
Any person with prolonged history of fever excluding malaria with history of
abdominal pain with or without skin rash, constipation or diarrhoea.
o Action threshold
Two suspected cases in a week at a health facility.
• Yellow Fever
Any person with sudden onset of fever, followed by jaundice within two weeks of
first symptoms, and/or a history of travelling from an endemic area.
o Action threshold
Single suspected case is a suspected outbreak.
Non-Outbreak Related Surveillance Data and Response
Introduction
• This section contains information that may be useful for the District Health Team to
use in interpreting surveillance data and providing guidelines for possible action
based on the interpretation.
• The district team will be looking at routine surveillance data for two main purposes:
o To examine district surveillance data over a time frame of months or years in
order to evaluate the public health interventions targeted at reducing the mortality
and morbidity of certain diseases/conditions that are under surveillance.
o To examine district surveillance data for "hidden" outbreaks
For example, shigella (diarrhoea with blood), meningitis, malaria, and/or measles
that were not detected by health facilities.
• The analysis of longer-term surveillance data is extremely important since more than
80% of the deaths covered by the diseases/conditions under surveillance occur as non-
outbreak related deaths.
Historical Trends and Epidemiology

• Review historical trends and epidemiology to establish thresholds for outbreaks, based on
previous epidemiology and seasonality
• For some diseases/conditions, annual and monthly trends may be available, and may also
be available as in-patients or outpatients.
• For bacillary dysentery, malaria, meningitis, and ‘endemic’ cholera, historical trends
allow the district to establish a threshold for declaring and reporting a suspected
outbreak.
• Historical data may give district team some clues about the previous epidemiology and
seasonality (diseases/conditions that may be helpful in planning prevention activities).
o Examining historical trends, the district team must remember that many of the past
cases may have been from past outbreaks that should have been prevented.
o The method used to determine the non-outbreak (baseline) incidence and hence the
threshold for investigating a suspected outbreak for the district should take this into
account.
o The thresholds for action should be re-evaluated periodically with the aim of lowering
the baseline number of cases that lead to an investigation of a suspected outbreak and
response if the incidence of disease is decreasing with time.
Time Analysis
• Since cases and deaths are collected on most types of diseases/conditions under
surveillance, trends in cases, deaths, and case fatality ratios can be examined.
• Since most diseases/conditions have separate data collection for in-patients and out-
patients, separate trends by in-and out-patients can be examined.
• In-patient trends are often valuable because in-patients have more severe disease,
diagnosis is often more accurate and therefore the data is more specific than outpatient
data.
o We are most interested in preventing communicable diseases and deaths, examining
trends in in-patient cases and deaths separately from outpatient data should be a high
priority.
o Trends by age and other factors can only be examined for diseases with case based
information.
In general, trend lines are going up, remaining level, or going down. For each of
these different types of trend lines, some possible explanations are given below
for the District Health Team to consider:
Increase in Disease Incidence: Issues to Consider

• When an increase in disease incidence is observed, it is important to consider the
following questions and potential explanations.
• Could it be a reporting artifact?
o Are more health facilities reporting this month compared to last month?
o To determine if a true increasing trend exists, trends can be examined by health
facility.
• Could it be a change in reporting criteria or modified case definition?
o Is the case definition being followed? Have new staff joined the facility that may be
reporting cases differently than occurred previously?
o Are the cases confirmed or suspected? For example, are some facilities now reporting
suspect cases (with no laboratory confirmation) when previously they reported only
lab-confirmed cases?
• Could it be a seasonal variation?
o Review disease incidence data (i.e., ‘new’ case totals) from a similar time period in
the previous year(s). Is this increase ‘expected’ based on seasonality of the condition
under surveillance?
• Could the neighboring districts be experiencing similar changes?
o For example, if you suspect an outbreak of diarrhoea with blood (shigella), ask
neighboring districts if they are seeing a similar trend in diarrhoea with blood.
• Are there any common features among the reported cases?
o Any geographic clustering? Are most of the ‘excess’ cases coming from just one
or all health facilities? If either, develop hypothesis and contact appropriate health
facility start to determine if the increased incidence represents an outbreak.
• Has a health facility initiated a new screening and treatment program?
o If so, this may lead to the identification and reporting of more cases than in prior time
periods.
• Has any new outreach or health education activity been implemented?
o This may increase healthcare-seeking behaviour and lead to the identification and
reporting of more cases than in prior time periods.
• Has there been any recent immigration of at-risk persons into the community?
o This may lead to an increase in susceptible and may result in increased disease in
incidence and reporting.
Decrease in Disease Incidence: Issues to Consider

• When a decrease in disease incidence is observed, it is important to consider the
following questions and potential explanations.
• Could it be a reporting artifact?
o Are more health facilities reporting this month compared to last month? To determine
if a true decreasing trend exists, examine trends by health facility.
• Could it be a change in reporting criteria or modified case definition?
o Is the case definition being followed? Has new staff joined the facility that may be
reporting cases differently than occurred previously?
o Are the cases confirmed or suspected? For example, are some facilities now reporting
only lab-confirmed cases when previously they may have reported both lab-confirmed
and suspected cases (with no lab confirmation)?
• Could it be a seasonal variation?
o Review disease incidence data (i.e. new case totals) from a similar time period in
prior year(s). Is this decrease ‘expected’ based on seasonality of the condition under
surveillance?
• Could the neighboring districts be experiencing similar changes?
• Does this decrease relate to the effectiveness of intervention program activities?
o If so, develop hypothesis and contact appropriate health facility staff both within the
intervention area and outside to determine if the decrease incidence reflects the
intervention.
• Has a health facility stopped providing selected services (e.g. no longer providing
screening and treatment for selected disease)?
o If so, this may lead to the identification and reporting of fewer cases than in prior time
periods.
• Have any new community outreach or health education program ceased to operate in the
community?
o This may lead to a reduction in healthcare-seeking behavior and lead to the
identification and reporting of fewer cases than in prior time periods.
• Has there been any recent out-migration of at-risk persons into the community?
o This may lead to an decrease in susceptible individuals and may result in decreased
disease incidence and reporting
No Change in Incidence: Issues to Consider

• When a disease incidence remains unchanged, it is important to consider the following
questions and potential explanations.
• Most of the diseases under surveillance have interventions (e.g. EPI, IMCI, Malaria
control, etc.) that can significantly decrease the incidence of the diseases.
• For some diseases/conditions (e.g. pneumonia), the district may not be able to decrease
the incidence of the diseases.
• If an effective public health intervention is operating in the district, then a decline in cases
of the disease/condition should occur over a relatively short time.
• Factors to be Considered When No Change in Incidence
o Reporting artifact
o Change in reporting criteria or use of case definition
o Seasonal variation
o Health facility stopping to provide selected services (e.g. no longer provides screening
and treatment for selected diseases)
o Community outreach or health education program ceased to operate in the community
o A health facility initiated a new screening and treatment program
o A new community outreach or health education activity been implemented
o Has there been any recent out-migration or immigration of at-risk persons from the
community?
Key Points
• Surveillance is being watchful or vigilant for health problems and determinants with the
intention to take action for improvement of health of a population.
• Integrated Diseases Surveillance and Response is a strategy proposed and adopted by the
WHO/AFRO Regional Assembly in 1998 to strengthen disease surveillance in member
countries using an integrated approach.
• The overall objective of Integrated Disease Surveillance and Response (IDSR) is to
provide epidemiological evidence for use in making decisions and implementing public
health interventions for the control and prevention of communicable diseases.
• There are specific issues to consider when there is an observed increase, decrease or no
change in disease incidence.
Evaluation
• Explain the standard case definitions for the following diseases:

o Measles
o Rabies
o Polio
o Yellow fever
o Neonatal tetanus
o Plague
• What are the two diseases that are reported weekly and two diseases reported monthly?
References
• MOHSW. (2001). National Guidelines for Integrated Disease Surveillance and Response
(IDSR). Dar es Salaam, Tanzania: Ministry of Health and Social Welfare.
• WHO. (2001). Technical Guidelines Integrated Disease Surveillance and Response in the
African Region. Harare, Zimbabwe: World Health Organization/Regional Office for
Africa.
Session 18: Planning for Disease Prevention and
Control
Learning Objectives
• Define the concept prevention of disease and control of disease
• Explain the healthcare planning cycle
• Describe reassessment of burden of disease in healthcare planning
Introduction to Heath Care Planning

• Prevention of Disease: Any activity which reduces the possibility of occurrence, or the
burden of morbidity, disability or mortality of a disease.
o The levels of prevention are categorised as primary, secondary and tertiary prevention
of disease.
• Control of Disease: Reduction of disease prevalence to a level where it is no longer a
public health problem.
• Healthcare Planning: The process of identifying key objectives and choosing among
alternative means for achieving them.
• Evaluation of Healthcare Services: The process of determining the relevance,
effectiveness, efficiency and impact of activities in a systematic way in line with the
agreed-upon objectives.
The Healthcare Planning Cycle

• The healthcare planning cycle is a cyclical and repetitive process that constitutes different
levels of interventions for the purpose of making rational decisions.
• The process consists of the following:
o Measurement or assessment of the burden of illness
o Identification of the cause of illness
o Measurement of the effectiveness of different community interventions
o Assessment of their efficiency in terms of resources used
o Implementation of interventions
o Monitoring of activities
o Reassessment of the burden of illness to determine whether it has been altered
• Epidemiology is involved in all stages of planning.
• The cyclical nature of the process indicates the importance of monitoring and evaluation
to determine whether the interventions have had the desired effects.
• The process is repetitive because each cycle of intervention usually has only a small
impact on the burden of illness, and repeated intervention is required.
Session 18: Planning for Disease Prevention and Control 169
Figure 1: Healthcare Intervention
1. Burden of
illness
6.
Monitoring 2.
Causation
HEALTH
CARE
INTERVENTIO
N
5. Implementation 3. Community
effectiveness
4.
Efficiency
Source: Tugwel et al, 1985
Components of the Healthcare Planning Cycle

• Burden of illness
o Measurement of overall health status of the community is the first step in the planning
process.
o The measurements can include prevalence rates, incidence rates, different measures of
mortality, and the number of cases of different diseases.
o The process of measuring the burden of illness must include different diseases.
• Causation
o After measuring the burden of disease in the community, it is important to identify the
major preventable causes of disease so that intervention strategies can be developed.
o Wherever possible interventions should have the prevention of disease as their
primary focus but, normally it is not always possible.
• Measuring effectiveness of different interventions
o It is important to measure the effectiveness of an intervention through indicators or
measurement of health status.
o Common measurements of health used in the planning process include morbidity and
mortality measures. They are used to allocate resources appropriately and equitably.
• Measures of morbidity and mortality
o Prevalence rate, incidence rate, incidence density (as described in Session 8: Source
and Uses of Morbidity and Mortality Statistics)
o Crude Death Rate:
Total deaths in defined population in a given time period divided by the total
population.
Total population
o Maternal mortality rate:

All maternal deaths occurring during pregnancy or within 42 days after termination
of pregnancy in a year, divided by the total number of live births in that year per
100,000)
No. of pregnancy-related deaths in time period
Maternal Mortality Rate =
100,000 live births
o Fetal and infant mortality measures.
Other Key Health Measures

• Life expectancy: the number of years an individual of a given age is expected to live if
current mortality rates continue.
• Median survival time: Refers to the time during which fifty percent of individuals with a
certain diagnosis or having had a certain intervention would have died or would have died
if there was no intervention.
• Person Years of Life Lost (PYLL)

o PYLL is the number of years an individual would have lived should this individual
not have died at this age.
o It is a measure of premature mortality.
o Areas experiencing high PYLL are less healthy than those with low PYLL.
o The upper limit could be life expectancy at that age in that area or in another area
(such as a developed country, like the USA)
For example: If an individual dies at the age of 30 years, the PYLL will be the
difference between the age at death and the life expectancy at that age in that area.
The life expectancy in Tanzania is estimated at 58 years. Therefore PYLL = 58 –
30 = 28 for this individual who died at 30 years of age.
• Potential productive years lost

o Refers to the number of years an individual would have remained productive if the
individual does not die at the current age.
For example, if an individual dies at the age of 34, the number of potential
productive years of life lost would be 21 if we take 55 years as the upper limit of
productivity (age of retirement).
Similarly it will be 26 years if the individual retires at 60 years of age.
• Disability
o A measure of presence of consequences of disease
o Several levels are known and defined as follows:
Impairment: any loss or abnormality of psychological, physiological, anatomical
structure or function.
Disability: any restriction or lack of ability to perform in the manner or within the
range considered normal for a human being.
Handicap: a disadvantage of a given individual resulting from an impairment or a
disability, that limits or prevents the fulfillment of a role that is normal for that
individual.
• Disability Adjusted Life Years (DALYs)

o DALYs combine the years of life lost due to a premature mortality and duration of
disability due to morbidity.
o This produces one measure which can be used for comparison or evaluation of an
intervention in a country.
o The higher the DALY the worse the health status of a country.
o In health interventions, cost-effectiveness of two or more interventions can be
compared using DALYs. An intervention whose cost averts more DALYs per unit is
said to be comparatively more cost-effective and hence preferred to others.
Efficiency, Implementation and Monitoring Stages in the Planning
• Efficiency
o This is a measure of the relationship between the results achieved and the effort
expended in terms of money, resources and time.
o It provides the basis for the optimal use of resources and involves the complex inter-
rationship of costs and effectiveness of an intervention.
o This is the area where epidemiology and health economics are applied together
o There are two main approaches to the assessment of efficiency:
Cost-Effectiveness
Cost-Benefit Analysis
o These two measures are important in prioritizing which intervention is best especially
for developing countries.
• Cost-Effectiveness Analysis
o Compares the ratio of financial expenditure and effectiveness
Dollars per case prevented, dollars per life-year gained, dollars per quality-
adjusted life year gained, etc.
• Cost-Benefit Analysis
o In this measure, both the denominator and numerator are expressed in monetary
terms.
o The health benefits (e.g. lives saved) are measured and given a monetary value.
o If the cost-benefit analysis shows that economic benefits of the program are greater
than the costs, the program should be seriously considered.
o The measurement of efficiency requires many assumptions, and it should be used very
cautiously; it is not value-free and can serve only as a general guideline.
• Implementation
o The fifth stage in planning process begins by determining a specific intervention and
takes into account the problems likely to be faced in and by the community.
For example, if a planned intervention involves screening women for breast
cancer using mammography, it is important to ensure that the necessary
equipment and personnel are available.
o This stage involves setting specific quantified targets,
For example, ‘To reduce the frequency of smoking in young women from 30% to
20% over a five year period.’
This type of target-setting is essential for assessing the success of an intervention.
• Monitoring
o Monitoring is the continuous follow-up of activities to ensure that they are proceeding
according to plan.
o Monitoring must be directed to requirements of specific program, the success of
which may be measured in a variety of ways using short-, intermediate- and long-term
criteria.
o For example, in a community-level hypertension program, monitoring could include
the regular assessment of:
Personnel training
The availability and accuracy of sphygmomanometers (structural)
The appropriateness of case-finding and management procedures (process
evaluation)
The effect on blood-pressure levels in treated patients (outcome evaluation)
Reassessment of the Burden of Illness

• Reassessment is final step in the healthcare planning process, and the first step in the next
cycle of activity.
• Reassessment requires a repeat measurement of the burden of illness in the population.
o For example, repeated surveys of population blood pressure levels.
• The following table shows concrete examples of activities undertaken at each stage of the
Healthcare Planning Cycle.
Figure 2: Healthcare Planning Cycle – The Case of Hypertension

Stage in Planning Cycle Activity
Burden Population surveys of blood pressure and control of hypertension
Aetiology (causation) Ecological studies (salt and blood pressure)
Observational studies (weight and blood pressure)
Experimental studies (weight reduction)
Community effectiveness Randomized controlled trials
Evaluation of screening programs
Studies of compliance
Efficiency Cost-effectiveness studies
Implementation National control programs for high blood pressure
Monitoring Assessment of personnel and equipment
Effect on quality of life
Reassessment Re-measurement of population blood pressure levels
Key Points
• Prevention of disease is any activity which reduces the possibility of occurrence, burden
of morbidity, disability or mortality of a disease
• The process of healthcare planning includes:
o Measurement or assessment of the burden of illness
o Identification of the cause of illness
o Measurement of the effectiveness of different community interventions
o Assessment of their efficiency in terms of resources used
o Implementation of interventions
o Monitoring of activities
o Reassessment of the burden of illness to determine whether it has been altered.
• The measures of efficiency are cost effectiveness and cost benefit analysis
• Cost-effectiveness analysis looks at the ratio of financial expenditure and effectiveness
• In cost-benefit analysis measures both the denominator and numerator are expressed in
monetary terms.
Evaluation
• What is the meaning and significance of PYLLs and DALYs?
• Define efficiency, cost benefit analysis, cost effective analysis
• What are the stages in the healthcare planning cycle?
References
Switzerland: WHO
The development of these training materials was supported through funding from the President’s Emergency Plan for AIDS Relief
(PEPFAR) through the U.S. Department of Health and Human Services, Health Resources and Services Administration (HRSA)
Cooperative Agreement No. 6 U91 HA 06801, in collaboration with the U.S. Centers for Disease Control and Prevention’s Global AIDS
Programme (CDC/GAP) Tanzania. Its contents are solely the responsibility of the authors and do not necessarily represent the official
views of HRSA or CDC.

SM_CMT 05101 Epidemiology and Biostatistics

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

SM_CMT 05101 Epidemiology and Biostatistics

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SM_CMT 05101 Epidemiology and Biostatistics

Uploaded by

Copyright:

Available Formats

UNITED REPUBLIC OF TANZANIA

Ministry of Health and Social Welfare

Background and Acknowledgement ........................................................................ iv

In September 2009, MOHSW embarked on an innovative approach of developing the

These participants are listed with our gratitude below:

Editorial Review Team

Librarians and Secretaries

Dr. Gilbert Mliga

Who is the Module For?

How is the Module Organized?

How Should the Module be Used?

Other Terms Used in Biostatistics

Commonly Used Symbols in Statistics

Two Forms of Statistics

Need for Biostatistics

Figure 2: Results of Comparison Between Two Treatments Among Females

Figure 3: Results of Comparison Between Two Treatments Among Males

Application of Biostatistics Methods

Introduction to Descriptive Statistics

Figure 4: Examples of Variables

Qualitative (Categorical) Variables

Descriptive Methods for Qualitative Data

Frequency and Relative Frequency Distribution

From the above data:

Note: IIII indicates 5 observations

Smear -ve, culture -ve 144 62.9 62.9

Figure 7: Birth Control Method Use in a Certain Population

Nominal Ordinal Interval Ratio

Commonly Used Symbols in Statistics

Descriptive Methods for Quantitative Data

Frequency Distribution for Ungrouped Data

Frequency Distribution for Grouped Data

Methods of Presenting Different Data

A histogram from Figure 3 will be presented as follows in Figure 4 below:

Figure 4: Histogram Showing Distribution of Age at Loss of Last Tooth

Age at loss of last tooth

Cumulative Frequency Curve

Measures of Location or Central Tendency

The Arithmetic Mean

o The arithmetic mean is denoted by x̄

Comparison of Mean, Median and Mode

Variance and Standard Deviation

Figure 9: Example of Calculating Mean, Variance and Standard Deviation

3. Calculate the standard s = 142.57

Activity: Small Group Exercise

Activity continued on next page.

Introduction to Probability Distributions

Requirements for a Probability Distribution

Continuous Probability Distribution

Characteristics of the Normal Distribution Curve

Source: Jones D. et al., 2008

Applications of the Standard Normal Distribution

Source: Jones et al, 2008

and SND2 = 120 – 105.8 = 1.06

Note: This figure is displayed

Refer to Handout 3.1: Table of Standard Normal Distribution

Activity: Small Group Exercise

Refer to Worksheet 3.1: Calculating the SND and review instructions.