Slide 1
Section 1-3
Critical Thinking
Course/Subject: Biostatistics and Epidemiology
Lecture/09.09.24 /BSMLS 3-A&B
Presented by: Enoch Taclan BS Biol, MSc Biol
Success in Statistics Slide 2
Success in the introductory statistics course
typically requires more common sense than
mathematical expertise.
This section is designed to illustrate how
common sense is used when we think critically
about data and statistics.
Definitions Slide 3
Sample
Is a smaller set or subset of the population
A representative sample is a subset that provides an
accurate picture of the whole population.
The data obtained from a sample of subjects may not be
exactly the same as data collected on the whole population.
However, when the sample is representative, the sample
data should be very similar to what would be found in the
whole population.
Definitions Slide 4
Biased sample
Only certain members of the population are chosen
so that sample systematically misrepresents the
population
Definitions Slide 5
Voluntary response sample
(or self-selected survey)
one in which the respondents themselves decide
whether to be included.
In this case, valid conclusions can be made only
about the specific group of people who agree to
participate.
Definitions Slide 6
Convenience sample
(investigator sample)
involves using respondents who are “convenient” to
the researcher.
Definitions Slide 7
Random sample
is a subset of a statistical population in which each
member of the subset has an equal probability of
being chosen.
Definitions Slide 8
3 Primary types of Random sampling
Simple –
Stratified –
Systematic –
Sampling frame
Random sample requires a list of the population
Ordered and every subject in the sample frame is assigned a
unique number
Unique numbers/identifiers a random number generator can
be used to randomly select the sample
Definitions Slide 9
Simple Random Sample
each subject in the population has the same chance of being
selected
Selected numbers are then matched with the subjects to
select the sample
Definitions Slide 10
Stratified random sampling
sampling technique in which the total population is divided
into homogenous groups (strata) to complete the sampling
process.
Total sample will be representative as long as the number of
subjects sampled in each region is proportional to the
population size of the region
Definitions Slide 11
Systematic random sampling
random sampling method that requires selecting samples
based on a system of intervals in a numbered population.
Number “s” is selected so that every sth
subject is selected to be in the sample
Misuses of Statistics Slide 12
Bad Samples
Biased Sampling
• Non-Representative Samples: If a
sample doesn’t accurately reflect the
population
• Selection Bias: This occurs when
certain members of a population
Prevention: Transparency and Documentation:
Clearly documenting the samples
Misuses of Statistics Slide 13
Small Samples
Overgeneralization of Results
• Broad Conclusions from Limited
Data: Researchers or decision-makers might
take results from a small
• Ignoring Sample Limitations: Failing
to acknowledge
Misuses of Statistics Slide 14
Small Samples
Example: A study with a small sample of 15 patients
claims that a new treatment is highly effective. The
results are widely publicized, leading to excitement
and adoption of the treatment. However, when
larger studies are conducted, they fail to replicate
the results, revealing that the initial findings were
likely due to random chance rather than true
effectiveness.
Prevention: Replication and Validation: Findings from
small samples should be replicated in larger studies to
validate the results
Misuses of Statistics Slide 15
Misleading graphs
graphs are a common way that
statistics can be misused to distort the
truth or create a false impression
Slide 16
Figure 1-1
Slide 17
To correctly interpret a graph,
we should analyze the numerical
information given in the graph
instead of being mislead by its
general shape.
Misuses of Statistics Slide 18
Bad Samples
Small Samples
Misleading Graphs
Pictographs
Misuses of Statistics Slide 19
Pictographs
are a visual representation of data
using pictures or symbols to represent
quantities. While they can be a helpful
tool for understanding data, they can
also be misleading if not used correctly.
Misuses of Statistics Slide 20
Misuses of Statistics Slide 21
Bad Samples
Small Samples
Misleading Graphs
Pictographs
Slide 22
Section 1-4
Design of Experiments
Created by Tom Wegleitner, Centreville, Virginia
Major Points Slide 23
If sample data are not collected in an
appropriate way, the data may be so
completely useless that no amount of
statistical tutoring can salvage them.
Randomness typically plays a critical
role in determining which data to
collect.
Slide 24
Your role as an investigator..
Will you just observe?
Will you intervene?
Slide 25
Will you observe or intervene?
What people do?
Observational
How to make changes?
Interventional
Slide 26
Choosing an appropriate study design
Observational Interventional
Case reports Non- randomized
Case series controlled trials
Cross sectional Randomized control
trial
Case-control CORRELATIONAL
STUDY
Cohort
ANALYTICAL
Definitions Slide 27
Cross Sectional Study
Data are observed, measured, and collected
at one point in time.
Retrospective (or Case Control) Study
Data are collected from the past by going
back in time.
Prospective (or Longitudinal or Cohort) Study
Data are collected in the future from groups
(called cohorts) sharing common factors.
Definitions Slide 28
Confounding
occurs in an experiment when the
experimenter is not able to distinguish
between the effects of different factors
Try to plan the experiment so confounding does not occur!
Controlling Effects Slide 29
of Variables
Blinding
subject does not know he or she is receiving a
treatment or placebo
Blocks
groups of subjects with similar characteristics
Completely Randomized Experimental Design
subjects are put into blocks through a process
of random selection
Rigorously Controlled Design
subjects are very carefully chosen
Replication and Slide 30
Sample Size
Replication
repetition of an experiment when there are
enough subjects to recognize the differences
in different treatments
Sample Size
use a sample size that is large enough to see
the true nature of any effects and obtain that
sample using an appropriate method, such as
one based on randomness
Random Sampling Slide 31
selection so that each has an
equal chance of being selected
Systematic Sampling Slide 32
Select some starting point and then
select every K th element in the population
Convenience Sampling Slide 33
use results that are easy to get
Stratified Sampling Slide 34
subdivide the population into at
least two different subgroups that share the same
characteristics, then draw a sample from each
subgroup (or stratum)
Cluster Sampling Slide 35
divide the population into sections
(or clusters); randomly select some of those clusters;
choose all members from selected clusters
Methods of Sampling Slide 36
Random
Systematic
Convenience
Stratified
Cluster
Definitions Slide 37
Sampling Error
the difference between a sample result and the true
population result; such an error results from chance
sample fluctuations
Nonsampling Error
sample data that are incorrectly collected, recorded, or
analyzed (such as by selecting a biased sample, using a
defective instrument, or copying the data incorrectly)
Recap Slide 38
In this section we have looked at:
Types of studies and experiments
Controlling the effects of variables
Randomization
Types of sampling
Sampling Errors
TYPES AND PRESENTATION
OF DATA
Course: Biostatistics and Epidemiology
Lecture/09.09.24 /BSMLS 3-A&B
Presented by: Enoch Taclan BS Biol, MSc Biol
Data
A set of values recorded on one or more observational
units i.e. Object, person etc
Types of data:
(A) Qualitative / Quantitative data
(B) Discrete / Continuous data
(C) Primary / Secondary data
(D) Nominal / Ordinal data
TYPES OF DATA
Qualitative Quantitative
Data Data
Discrete
Nominal Ordinal Continuous
Interval Ratio
Qualitative data:
• also called as enumeration data .
• Represents a particular quality or attribute.
• There is no notion of magnitude or size of the
characteristic, as they can't be measured.
• Expressed as numbers without unit of
measurements . Eg: religion, Sex, Blood group etc.
Quantitative data:
• Also called as measurement data.
• These data have a magnitude.
• Can be expressed as number with or without unit
of measurement. Eg: Height in cm, Hb in gm%, BP
in mm of Hg, Weight in kg.
Discrete / Continuous data:
Discrete data: Here we always get a whole number. Eg.
Number of beds in hospital, Malaria cases .
Continuous data : it can take any value possible to measure or
possibility of getting fractions. Eg. Hb level, Ht, Wt.
Quantitative data Qualitative data
Hb level in gm% Anemic or non anemic
Ht in cms Tall or short
BP in mm of Hg Hypo, normo or hypertensive
IQ scores Idiot, genius or normal
Primary/ Secondary data:
Primary data : Obtained directly from an individual, it
gives precise information .
Secondary data : Obtained from outside source,Eg: Data
obtained from hospital records, Census.
Nominal/ Ordinal data:
Nominal data: the information or data fits into one of the
categories, but the categories cannot be ordered one above
another.
Ordinal data: here the categories can be ordered, but the
space or class interval between two categories may not be the
same.
COLLECTION OF DATA
Collect data carefully and thoroughly.
Units of measurements should be clearly defined.
Record should be correct , complete, clear, sufficiently
concise and arranged in a manner that is easy to
comprehend.
Collected data should be
• Accurate (i.e. Measures true value of what is under study)
• Valid( i.e. Measures only what is supposed to measure)
• Precise(i.e. Gives adequate details of the measurement)
• Reliable(i.e. Should be dependable)
SOURCES FOR COLLECTION OF
DATA
Census: Defined as “The total process of collecting,
compiling and publishing demographic, economic and
social data pertaining at a specific time or times, to all
persons in a country or delimited territory.”.
Registration of vital events: Civil registration System.
Ex: Birth Registration: Philippine Statistics Authority
(PSA): The PSA is the central authority for compiling and
managing vital statistics and provides a National
Statistics Office (NSO) certification for birth certificates.
CONTINUED..
Sample Registration System (SRS): Dual record system,
consisting of continuous enumeration of births and deaths by
an enumerator and independent survey every 6 months by an
investigator-supervisor.
CONTINUED..
Notification of diseases: Valuable source of morbidity data
such as incidence, prevalence and distribution of certain
specified diseases which are notifiable. Internationally
notifiable diseases: Cholera, Plague and Yellow fever.
Hospital Records: Primary and basic source of
information about disease prevalent in the community.
CONTINUED..
Epidemiological Surveillance: Special surveillance
activities are conducted for diseases like Malaria, Leprosy,
TB, Filariasis, AIDS, COVID-19 and etc.
Surveys: Population surveys supplement routinely
collected statistics.
Research Findings: Findings of various research or
investigations are helpful for planning and implementation
of health activities in general.
PRESENTATION OF DATA
Principles of presentation of data:
Data should be arranged in such a way that it will
stimulate interest in reader.
The data should be made sufficiently concise without
losing important details.
The data should be presented in simple form to enable
the reader to form quick impressions and to draw some
conclusion, directly or indirectly.
Should facilitate further statistical analysis .
It should define the problem and suggest its solution.
METHODS OF
PRESENTATION OF DATA
The first step in statistical analysis is to present
data in an easy way to be understood.
The two basic ways for data presentation are
Tabulation
Charts and diagram
RULES AND GUIDELINES FOR TABULAR
PRESENTATION
1. Table must be numbered
2. Brief and self-explanatory title must be given to each table.
3. The heading of columns and rows must be clear, sufficient,
concise and fully defined.
4. The data must be presented according to size of importance,
chronologically, alphabetically or geographically
5. If data includes rate or proportion, mention the denominator.
6. Table should not be too large.
7. Figures needing comparison should be placed as close as
possible.
Table of Patient Outcomes by Treatment Group
Table 1: Comparison of Patient Outcomes by
Treatment Group for Hypertension
Average
Systolic Percentage
Treatment Number of
Blood Achieving Denominator
Group Patients
Pressure Target BP (%)
(mmHg)
Group A:
150 130 80% 150 patients
Drug X
Group B:
140 140 60% 140 patients
Drug Y
Group C:
160 150 40% 160 patients
Placebo
CONTINUED..
8. The classes should be fully defined, should not lead to any
ambiguity.
9. The classes should be exhaustive i.e. should include all the
given values.
10. The classes should be mutually exclusive and non
overlapping.
11. The classes should be of equal width or class interval should
be same
12. Open ended classes should be avoided as far as possible.
13. The number of classes should be neither too large nor too
small.Can be 10-20 classes.
14. Formula for number of classes (K):
K=1+3.322 log10 N, where N is total frequency
SEATWORK: FREQUENCY
DISTRIBUTION
Suppose you have data on the systolic blood pressure (BP) of 200
patients, and you want to create a frequency distribution table.
Data Range: 90 mmHg to 180 mmHg
1. Calculate the Number of Classes =
Using the formula 𝐾=1+3.322log10 N
Total frequency 𝑁=200
Round to the nearest whole number =
K=
2. Determine Class Width:
Range of data = 180 mmHg - 90 mmHg =
Number of classes K=
Class width = 90/9 =
FREQUENCY DISTRIBUTION
3. Define Classes:
• Class 1: 90-99 mmHg
• Class 2: 100-109 mmHg
• Class 3: 110-119 mmHg
• Class 4: 120-129 mmHg
• Class 5: 130-139 mmHg
• Class 6: 140-149 mmHg
• Class 7: 150-159 mmHg
• Class 8: 160-169 mmHg
• Class 9: 170-179 mmHg
FREQUENCY DISTRIBUTION
Systolic BP (mmHg) Number of Patients
90-99 12
100-109 20
110-119 25
120-129 30
130-139 35
140-149 28
150 - 159 18
160-169 18
170-179 14
TABULATION
Can be Simple or Complex depending upon the number of
measurements of single set or multiple sets of items.
Simple table :
Title: Numbers of cases of various diseases in AMCM
Disease Cases
Malaria 1100
Acute GE 248
Leptospirosis 60
Dengue 100
Total 1308
FREQUENCY DISTRIBUTION TABLE WITH
QUALITATIVE DATA:
Title: Cases of malaria in adults and children in the
months of June and July 2010 in AMCM Hospital.
Jun-10 Jul-10
Type of
malaria Adult Child Adult Child Total
P.Vivax 54 9 136 23 222
P.Falciparu
m 11 0 80 13 104
Mixed
malaria 11 4 36 12 63
Total 76 13 225 43 389
FREQUENCY DISTRIBUTION TABLE WITH
QUANTITATIVE DATA:
Fasting blood glucose level in diabetics at the time of diagnosis
Fasting No of diabetics
glucose level Male Female Total
120-129 8 4 12
130-139 4 4 8
140-149 6 4 10
150-159 5 5 10
160-169 9 6 15
170-179 9 9 18
180-189 3 2 5
44 34 78
CHART AND DIAGRAM
Graphic presentations used to illustrate
and clarify information.
• Tables are essential in presentation of scientific data and
diagrams are complementary to summarize these tables
in an easy, attractive and simple way.
The diagram should be:
Simple
Easy to understand
Save a lot of words
Self explanatory
Has a clear title indicating its content
Fully labeled
The y axis (vertical) is usually used for
frequency
VARIOUS CHARTS AND
DIAGRAMS
Bar Diagram
Histogram
Frequency polygon
Cumulative frequency curve
Scatter diagram
Line diagram
Pie diagram
BAR DIAGRAM
• Widely used, easy to prepare tool for comparing
categories of mutually exclusive discrete data.
• 3 types of bar diagram:
Simple
Multiple or compound
Component or proportional
SIMPLE BAR DIAGRAM:
Malaria cases in AMCM Hospital in July 2010
120
100
80
60
Total No cases Male
40
20
0
P.Vivax P.Falciparum Mixed malaria
Multiple bar Diagram/Chart
Each observation has more than one value,
represented by a group of bars.
Examples:
Percentage of males and females in different
countries
Percentage of deaths from heart diseases in old
and young age
Mode of delivery (cesarean or vaginal) in
different female age groups
MULTIPLE OR COMPOUND
DIAGRAM
Distribution of malaria cases in AMCM Hospital in
July 2010
120
100
102
80
60 Male
62
57 Female
40
31 29
20
19
0
P.Vivax P.Falciparum Mixed malaria
Component bar chart
• subdivision of a single bar to indicate the
composition of the total divided into sections
according to their relative proportion.
COMPONENT OR PROPORTIONAL BAR
DIAGRAM
Proportion of energy intake obtained from various food
stuff by poor and rich community
100%
90%
80%
70% 55
% of energy obtained Fats
60% 80
50% % of energy obtained
Protein
40%
% of energy obtained
30% 30 Carbohdrate
20%
10
10%
10 15
0%
Poor Community Rich Community
HISTOGRAM:
It is very similar to the bar chart with the
difference that the rectangles or bars are
adherent (without gaps).
It is used for presenting class frequency table
(continuous data).
Each bar represents a class and its height
represents the frequency (number of cases),
its width represent the class interval.
HISTOGRAM
Distribution of studied group according to their height
30
number of individuals
25
20
15
10
0
100- 110- 120- 130- 140- 150-
height in cm
FREQUENCY POLYGON
Derived from a histogram by connecting the mid
points of the tops of the rectangles in the
histogram.
The line connecting the centers of histogram
rectangles is called frequency polygon.
We can draw polygon without rectangles so we
will get simpler form of line graph.
A special type of frequency polygon is the Normal
Distribution Curve.
Frequency polygon
Fasting blood glucose level in diabetics at the time of
diagnosis
20
18
16
14
12
10
8 No of diabetics
6
4
2
0
120- 130- 140- 150- 160- 170- 180-
129 139 149 159 169 179 189
SCATTER/ DOT DIAGRAM
Also called as Correlation diagram ,it is useful to
represent the relationship between two numeric
measurements, each observation being
represented by a point corresponding to its value
on each axis.
In negative correlation, the points will be scattered
in downward direction, meaning that the relation
between the two studied measurements is
controversial i.e. if one measure increases the
other decreases
While in positive correlation, the points will be
scattered in upward direction.
Dengue cases During monsoon in
AMCM Hospital: Year 2010
500
450 august, 450
400
350
300 july, 304 *Dengue
250
Cases
200
150
100
june, 89
50
May, 30
0
LINE DIAGRAM:
IT IS DIAGRAM SHOWING THE RELATIONSHIP BETWEEN TWO
NUMERIC VARIABLES (AS THE SCATTER) BUT THE POINTS ARE
JOINED TOGETHER TO FORM A LINE (EITHER BROKEN LINE OR
SMOOTH CURVE. USED TO SHOW THE TREND OF EVENTS WITH
THE PASSAGE OF TIME.
Changes in body temperature of a patient after use of antibiotic
39.5
39
38.5
temperature
38
37.5
37
36.5
36
1 2 2 4 5 6 7
time in hours
PIE DIAGRAM:
Consist of a circle whose area
represents the total frequency (100%)
which is divided into segments.
Each segment represents a proportional
composition of the total frequency.
PIE DIAGRAM:
Distribution of malaria cases in AMCM Hospital in
july 2010
Mixed
malaria
15%
P.Falciparum P.Vivax
32% 53%
CATEGORICALVARIABLES
• Displays of Categorical
Data
– Frequencies
– Bar Graph
– Pie Chart
CATEGORICALVARIABLES
Variable (Sex) Frequenc Proportion
y
Male 609 0.61
Female 391 0.39
Total 1000 100
70
0
60
0
50
0
40
0
30 Mal Femal
0 e e
20 Bar
0 Graph Pie
BAR GRAPH
NUMERICAL VARIABLES
Central
Tendency
Numerical
Spread
MEASURES OF CENTRAL
TENDENCY
• The 3 M's
– Mean
– Median
– Mode
MEASURES OF CENTRAL
TENDENCY
Sample Mean
The sample mean, 𝑥, is the sum of all values in the
sample divided by the total number of observations, n,
in the sample.
∑𝑖=1
𝑛
𝑥𝑖
𝑥=
𝑛
EXAMPLE: SAMPLE MEAN
Mean systolic blood pressure
Scenario 1: Subjects BP
1 120 (x1)
Mean = (120 + 135 + 115 + 2 135 (x2)
3 115 (x3)
110 + 105 + 140)/6
4 110 (x4)
=121 5 105 (x5)
6 140 (x6)
SAMPLE MEAN
• The mean is affected by extreme
observations and is not a resistant
measure. Subjects BP
1 120 (x1)
2 135 (x2)
Scenario 2: 3 115 (x3)
Mean = (120 + 135 + 115 + 110 + 4 110 (x4)
5 105 (x5)
105 + 140 + 280)/7 =144
6 140 (x6)
7 280 (x7)
MEDIAN
• The sample median, M, is the number
such that “half" the values in the
sample are smaller and the other “half"
are larger.
• Use the following steps to find M.
– Sort the data (arrange in increasing order).
– Is the size of the data set n even or odd?
– If odd: M = value in the exact middle.
– If even: M = the average of the two
middle numbers.
EXAMPLE: SAMPLE MEDIAN
• Median systolic BP:
Scenario 1:
120 : 135 : 115 : 110 : 105 : 140
Median = (115 + 110) /2 = 112.5
Scenario 2:
120 : 135 : 115 : 110 : 105 : 140 : 280
Median = 110
• The median is not affected by
extreme observations and is a
resistant measure.
MODE
• The sample mode is the value that occurs
most frequently in the sample (a data set
can have more than one mode).
• This is the only measure of center which
can also be used for categorical data.
• The population mode is the highest point
on the population distribution.
SYMMETRIC DATA DISTRIBUTION
6
4
Frequen
3
cy
0 1 2 3 4 5
0 0 0 0 0
Valu
e
RIGHTWARD SKEWNESS OF DATA
6 Mode Median
Mean
5
4
Frequen
3
cy
0 1 2 3 4 5
0 0 0 0 0
Valu
e
LEFTWARD SKEWNESS OF DATA
6 Mean Median
Mode
5
4
Frequen
3
cy
0
1 2 3 4 5
0 0 0 0 0
Valu
e
NUMERICAL MEASURES OF
SPREAD
• Range
• Sample Variance
• Inter Quartile Range (IQR)
NUMERICAL MEASURES OF
SPREAD
Range: The range of the data set is the
difference between the highest value and
the lowest value.
Range = highest value - lowest value
– Easy to compute BUT ignores a great
deal of information.
– Obviously the range is affected by
extreme observations and is not a
resistant measure.
NUMERICAL MEASURES OF
SPREAD
• Variance: equal to the sum of squared deviations
from the sample mean divided by n - 1, where n is the
number of observations in the sample.
VARIANCE
Consider this data set : 2, 4, 4, 6, 8
Calculate the sample mean (𝑥): Sum all the data
points and divide by the number of observations (𝑛).
Subtract the mean from each data point to get the
deviation of each data point from the mean.
Square each deviation to eliminate negative values.
Sum all the squared deviations.
Divide the sum of squared deviations by 𝑛−1, where 𝑛
is the number of observations.
NUMERICAL MEASURES OF
SPREAD
• Percentile: The percentile of a distribution
is the value at which observations fall at or
below it.
PERCENTILE
Consider this data set : 15, 20, 35, 40, 50
1. Sort the data set in ascending order.
2. Determine the rank (index) for the percentile using
the formula:
Where P is the desired percentile, and n is the
number of observations in the data set
3. If the rank is an integer, the value at that position is
the percentile.
4. If the rank is not an integer, interpolate between
the closest ranks.
NUMERICAL MEASURES OF
SPREAD
• The most commonly used percentiles are
the quartiles.
1st quartile Q1 = 25th percentile.
2n quartile Q2 = 50th percentile.
d
quartile Q1 = 75th percentile.
3rd
NUMERICAL MEASURES OF
SPREAD
Inter Quartile Range (IQR)
A simple measure spread giving the range
covered by the middle half of the data is the
(IQR) defined below.
IQR = Q3 - Q1
The IQR is a resistant measure of spread.
INTER QUARTILE RANGE
Consider the data: {4,8,15,16,23,42}
•Sort the data set in ascending order
•Find the first quartile (Q1): This is the
25th percentile of the data set
•Find the third quartile (Q3): This is the
75th percentile of the data set
•Calculate the IQR by subtracting Q1
from Q3
NUMERICAL MEASURES OF
SPREAD
Outliers: extreme observations that fall
well outside the overall pattern of the
distribution.
• An outlier may be the result of a
– Recording error,
– An observation from a different population,
– An unusual extreme observation
(biological diversity)
OUTLIERS
Consider the data set: {4,8,15,16,23,42}
1.Calculate Q1 and Q3 (as shown in the previous
example).
2. Calculate the IQR:
IQR=𝑄3−𝑄1
3. Determine the lower and upper bounds for outliers:
Lower Bound=𝑄1−1.5×IQR
Upper Bound=𝑄3+1.5×IQR
4.Identify any data points that fall below the lower bound
or above the upper bound. These points are considered
outliers.
NUMERICAL MEASURES OF
SPREAD
ASSOCIATION BETWEEN
VARIABLES
• Explanatory (exposure) variable
“X”
• Response (outcome) variable
“Y”
ASSOCIATION BETWEEN
VARIABLES
ASSOCIATION BETWEEN
VARIABLES
ASSOCIATION BETWEEN
VARIABLES
MEASUREMENT OF
CORRELATION
CORRELATION IS NOT
ASSOCIATION
REGRESSION