[go: up one dir, main page]

0% found this document useful (0 votes)
4 views70 pages

L-04 Producing Data Sampling and Design Experiment

The document discusses the fundamentals of biostatistics, focusing on data production through sampling and experimental design. It outlines various sampling methods, including voluntary, judgment, convenience, and random sampling, along with the importance of reducing bias in sample selection. Additionally, it covers observational studies, confounding variables, and different study designs such as cohort, case-control, and cross-sectional studies.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views70 pages

L-04 Producing Data Sampling and Design Experiment

The document discusses the fundamentals of biostatistics, focusing on data production through sampling and experimental design. It outlines various sampling methods, including voluntary, judgment, convenience, and random sampling, along with the importance of reducing bias in sample selection. Additionally, it covers observational studies, confounding variables, and different study designs such as cohort, case-control, and cross-sectional studies.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Fundamentals of Biostatistics

(PBH 611.4)

Lecture 4 : Producing Data: Sampling &


Design of Experiments
November 20, 2020
Producing Data: Sampling
Population, Sample and Sampling Design

• The population is the


entire group of individuals
that we want information
about
• The sample is part of the
population we examine in
order to gather
information
• The sampling design
describes how to obtain
the sample from the
population
Sampling

• In its broadest sense, sampling is a procedure


by which one or more members of a population
are picked from the population.

• The objective is to make certain observations


upon the members of the sample and then, on
the basis of these results, to draw conclusions
about the characteristics of the entire
population.
Some Types of Sampling

• Voluntary response samples: consists of those


who choose themselves by responding to a
general appeal
• Judgment samples: consists of those whom the
person doing sampling thinks best represents
the population
• Convenience samples: consists of those
easiest to reach
• Random samples: consists of those individuals
selected by a randomizing device
Voluntary Samples: CNN on-line surveys
Bias: People have to care enough about an issue to bother replying. This sample
is probably a combination of people who hate “wasting the taxpayers’ money”
and “animal lovers.”
Probability or Random Samples

The selection of the sample unit is:


• out of the hands of the person doing the
sampling and
• out of the hands of the person being sampled

The likelihood of bias is reduced or


eliminated.
Common Types of Random Sampling

• Simple random sampling

• Multistage sampling

• Stratified random sampling

• Cluster random sampling

• Multi-state sampling
Selecting a Sample
The selection process:
• Assign to each member of the population the equivalent of
sequential ID number;
• Use computer generated numbers or a random number
table
• For computer generated numbers, generate one for each
ID number, sort the ID numbers in order according to the
random number and take the first on the list up to the point
when you have the sample size you need

• For a table, haphazardly select a starting point and then


• Ignore numbers that are too large
• Ignore a number after it appears the first time
Random
Number
Table
45 46 71 17 09 77 55 80 00 95 32 86 32 94 85 82 22 69 00 56

52 71 13 88 89 93 07 46 02 … 01 Alison
02 Amy
Choose a random sample of size 5 by reading through the list of
03 Brigitte
two-digit random numbers, starting with line 103 and on. 04 Darwin
05 Emily
The first five random numbers matching numbers assigned to 06 Fernando
07 George
people make the SRS. 08 Harry
09 Henry
The first individual selected is Ramon, number 17. Then
10 John
Henry (9 or “09”). That’s all we can get from line 103. 11 Kate
12 Max
We then move on to line 104. The next three to be 13 Moe
selected are Moe, George, and Amy (13, 7, and 2). 14 Nancy
15 Ned
• Remember that 1 is 01, 2 is 02, etc. 16 Paul
• If you were to hit 17 again before getting five people, 17 Ramon
18 Rupert
don’t sample Ramon twice—you just keep going.
19 Tom
20 Victoria
Cluster Random Sampling
• Cluster sampling is an example of 'two-stage
sampling' .
• First stage a sample of areas is chosen;
• Second stage a sample of respondents within
those areas is selected.
• Population divided into clusters of homogeneous
units, usually based on geographical contiguity.
• Sampling units are groups rather than individuals.
• A sample of such clusters is then selected.
• All units from the selected clusters are studied.
Systematic Sampling

Why?
• easy
• can be very efficient depending on the
structure of the population
How?
• get a random start in the population
• sample every kth unit for some chosen
number k
Multistage Sampling
• Divide the population into groups. Take a
simple random sample of the groups. Within
each selected group take a simple random
sample of the units
Example:
• pick a random sample of the area
code/exchanges in London (i.e. from 519-310,
519-430-2-3-4-5-8-9, etc.)
• for the selected area code/exchanges pick
several four-digit random numbers within each
exchange to complete the telephone numbers
• the resulting sample will be a (two-) multistage
sample of telephone numbers in London
Example: Ontario Health Survey

• carried out in 1990

• health status of the population was measured

• data were collected relating to the risk factors


associated with major causes of morbidity
and mortality in Ontario

• survey of 61,239 persons was carried out in


a stratified two-stage cluster sample by
Statistics Canada
OHS
Sample Selection

• strata: public health units –


divided into rural and urban
strata
• first stage: enumeration
areas defined by the 1986
Census of Canada and
selected randomly
• second stage: dwellings
selected randomly
• cluster: all persons in the
dwelling
Which sampling techniques are being used?

• At a U.S. college there are 120 freshmen, 90


sophomores, 110 juniors, and 80 seniors. A school
administrator selects a simple random sample of 12,
9, 11 and 8 students from freshmen, sophomores,
juniors and seniors respectively. She then interviews
all the students selected.
• From a group of 496 students, every 49th student was
selected for a sample. The starting point was selected
at random and it turns out to be the 3rd student.
• . A pollster uses a computer to generate 500 random
numbers and then interviews the voters
corresponding to those numbers
Potential Problems After Random Selection

• Undercoverage

• Nonresponse

• Response bias

• Question wording
Nonresponse

Nonresponse occurs when an individual


chosen for the survey can’t be
contacted or refuses to cooperate

• cannot be contacted: not-at-homes in a


telephone survey

• refuses to cooperate: hang up on the


telephone interviewer
Response Bias

Response bias occurs when there is a


difference between the information
provided and the “appropriate” information

• sensitive questions – some may not want


to answer certain questions truthfully

• interviewer problems – the interviewer


encourages certain responses over others
Question Wording

The wording of the question can make a big


difference to the response received.
• The government should force you to pay
higher taxes. VS
• The government should increase taxes, or,
the government needs to increase taxes.
Multiple Choice Questions
You want to estimate the total number of words that are
misprints in a novel printed by a publisher on a very low
budget. In order to find a misprint in any passage of any
length, you must read through the passage and count the
number of misprinted words. The novel has 25 chapters and
the total length of the book is 476 pages. You decide to
sample at random two pages in every chapter and count the
misprints on the pages you read. Which method of sampling
did you use?
A. Convenience sampling
B. Judgement sampling
C. Simple random sampling
D. Stratified sampling
E. Two-stage sampling
Multiple Choice Questions

To assess the opinion of students at NSU about campus


safety, a reporter for the Gazette interviews 15 students
she meets walking on the campus late at night who are
willing to give their opinion. The sample is

A. all those students walking on campus late at night.


B. all students on campus with safety issues.
C. the 15 students interviewed.
D. all students approached by the reporter
E. none of the above.
Multiple Choice Question
Six students in the labelled list below are enrolled in a course: 0.
Maria 1. Chinmoy 2. Shamim 3. Gion 4. Joly 5. Karim. Use the
following list of random digits and the above labels to choose a
simple random sample of three to be interviewed in detail about
the quality of the course.
55897 24306 41842 81868 71035 09001 43367 49497 54580
The sample you obtain is
A. Gion, Joly, Karim
B. Joly, Karim, Shamim
C. Karim, Shamim, Joly
D. Chinmoy, Karim, Shamim
E. Any set of three names, since the numbers are random.
BREAK
Producing Data: Experiments
An Observational Study

• four French horn students


in the Faculty of Music
each played an “open”
note (a G) for as long as
they could
• the time in seconds that
they held the note was
recorded
• the process was repeated
three times
Confounding: A Common Problem with
Observational Studies

Example:
• want to compare pay levels between two types
of jobs: inside workers and outside workers
• the inside workers evaluated turn out to be all
(or mostly) female
• the outside workers evaluated turn out to be all
(or mostly) male
• are pay differences due to gender or to job
type?
Confounding (Lurking Variable)
■ Consider results from the following (fictitious) study:
- This study was done to investigate the association between

smoking and a certain disease in male and female adults


- 210 smokers and 240 non-smokers were recruited for the study
Results for All Subjects
Smokers Non-Smokers Totals
Disease 52 64 116
No Disease 158 176 334
Totals 210 240 450
Further Investigation

■ Smoking is protective against disease?

■ Most of the smokers are male and


non-smokers are female

All Subjects
Smokers Non-Smokers Totals
Male 160 40 200
Female 50 200 250
Totals 210 240 450
Further Investigation

■ Smoking is protective against disease?

■ Further, most of the persons with disease


are female

All Subjects

Disease No Disease Totals

Male 33 167 200

Female 83 167 250

Totals 116 324 450


Disease

Smoking
Sex
Further Investigation
■ The original outcome of interest is DISEASE

■ The original exposure of interest is SMOKING

■ In this sample, SEX is related to both the outcome and


exposure
- This relationship is possibly impacting overall
relationship between DISEASE and SMOKING

■ How can we look at the relationship between DISEASE and


SMOKING removing any possible “interference” from SEX?
- One approach—look at DISEASE and SMOKING
relationship separately for males and females
Example
■ Is smoking related to disease in males?
Results for Males
Smokers Non-Smokers Totals
Disease 29 4 33
No Disease 131 36 167
Totals 160 40 200
Example
■ Is smoking related to disease in females?
Results for Females
Smokers Non-Smokers Totals
Disease 23 60 83
No Disease 27 140 167
Totals 50 200 250
Smoking, Disease, and Sex
■ A recap
- The overall (sometimes called crude, unadjusted) relationship

(RR) between smoking and disease was nearly one (risk


difference nearly 0)

- The sex specific results showed similar positive associations


between smoking and disease

- Males :

- Females:

- (Note, for the moment we are not considering statistical


significance, we are just using estimates to illustrate the point)
Simpson’s Paradox

■ The nature of an association can change (and


even reverse direction) or disappear when data
from several groups are combined to form a
single group

■ An association between an exposure X and


a disease Y can be confounded by another
lurking (hidden) variable Z
What Is the Solution for Confounding?

■ If you DON’T KNOW what the potential confounders are,


there’s not much you can do after the study is over
- Randomization is the best protection
- Randomization eliminates the potential links
between the
exposure of interest and potential confounders Z1,
Z 2, Z 3

■ If you can’t randomize but KNOW what the potential


confounders are, or there are statistical methods to
help control (adjust for confounders)
- Potential confounders must be measured as part of
How to Adjust for Confounding?
■ Stratify
- Look at tables separately

- For example, male and females, clinic

- Take weighted average of stratum specific estimates

■ For example, in the disease/smoking situation


- To get a sex adjusted relative risk for the smoking disease

relationship we could weight the sex-specific relative risks by


numbers of males and females
Effect Measure Modifier
• The association between treatment and
outcome depends on a third variable
• Age can be an effect measure modifier
when looking at the association between
treatment (surgery vs drug) and outcome
(died vs survived)
• Surgery may be better for younger patients
while drug therapy is better for the older
patients.
• Better control in the analysis stage!
Observational Study: Cohort Study

• Choose a fixed number with or without


exposure
• Follow subjects for a specified time period
and determine who has disease /outcome
of interest

• Measure of association of interest


– Difference in proportion
– Relative Risk (ratio of proportions)
– Odds Ratio
Observational Study: Case-Control Study

• Choose a fixed number cases and controls


• Follow subjects retrospectively to see who
has had the exposure of interest

• Measure of association of interest


– Odds Ratio
Example: Case-Control Study
• Researchers were interested in studying
the association between alcohol
consumption and esophageal cancer
• Esophageal cancer is a rare condition – a
prospective study would require a huge
number of subjects

• Can we measure prevalence or the relative


risk of cancer?
Caveat in Case-Control Studies
• The percentage of population who have the
disease can not be correctly estimated from
a case-control study
• Hence, it is not possible to estimate relative
risk (RR) relating disease to exposure

• CANNOT calculate relative risk from a


case-control study.
• CAN compute odds ratio from case-control
studies (rare disease situation OR≈RR)
Cross Sectional Study
■ A cross-sectional study is an observational
study in which exposure and disease are
determined at the same point in time in a
given population

■ The temporal relationship between exposure


and disease cannot be determined

■ It assesses the prevalence of exposures


and/or of diseases in the population &
provides clues for further research into the
Design of a Cross-Sectional Study

Begin Defined
with: Populatio
n

Gather data on exposure and


disease

Four
Expose Expose Not Not
Groups
d; Have d; Do Expose Expose
Are
Disease Not d; Have d; Do
Possibl
Have Disease Not
e
Disease Have
Disease
Design and Analysis of a Cross-Sectional
Study
DiseaseNo Disease

Exposed a b

Not c d
Expose
d

Diseas No Diseas No
e Disease e Disease
Expose a b Expose a b
d d
Not c d c d
Not
Expose Expose
d d
Design and Analysis of a Cross-Sectional
Study
Diseas No Diseas No
e Disease e Disease
Expose a Expose a
d b d b
Not c d Not c d
Expose Expose
d d Prevalence of disease
Prevalence of exposure
in Disease and No in Exposed
Disease compared to Not
Exposeda
a b c
a+c
vs.
b+d a+ vs.
c+d
b
BREAK
Basic Methodology in Experiments

• divide the experimental units into a


number of different groups
corresponding to the number of
treatments to be tested
• apply different treatments to each
group
• measure the units to see if there are
differences between the treatments
Example: the Salk Vaccine Trials

• 1% of people who contracted


the polio virus suffered from
the paralytic form of the
disease
• a vaccine was developed by
Jonas Salk
• tested in 1954
• in the test 440,000 children
received the vaccine and
210,000 the placebo
Three Principles of Experimental Design

1. Control the effects of lurking variables on the


response, most simply by comparing two or more
treatments.

2. Randomize—use impersonal chance to assign


subjects to treatments. This controls bias and
helps to achieve homogeneity of treatment
groups.

3. Replicate—use enough subjects in each group


to reduce chance variation in the results.
The Importance of Randomization

• Important to reduce for many kinds of


biases (ex. selection bias)
• Randomization, done correctly on a
large number of subjects, nearly
ensures that the only difference in the
groups being compared is the
exposure(s) of interest
• Does randomization always possible?
Double-blind Experiments

In a double-blind experiment, neither


the subjects nor the people who
interact with them know which
treatment each subject is receiving
The Salk Vaccine Trials Again

• tested in 1954
• 440,000 children
received the vaccine and
210,000 the placebo
• double-blind: neither the
patient nor the
administering physician
knew if the patient
received the vaccine or
the placebo
Types of Experimental Designs

Completely randomized designs


• all experimental units are allocated at random
among all the treatments
Randomized block designs
• experimental units are divided into groups of
similar units called blocks; within the block the
experimental units are allocated at random
among all the treatments
Matched Pairs Designs
• Compares two treatments, pairs of subjects
are chosen as closely matched as possible
Completely randomized designs
In a completely randomized experimental design, individuals are
randomly assigned to groups, then the groups are randomly assigned
to treatments.
Another Example: Caffeine
Caffeine is a common drug that affects the
central nervous system. Among the issues
involved with caffeine are how it gets from the
blood to the brain, and whether the presence
of caffeine alters the ability of similar
compounds to move across the blood-brain
barrier. In an experiment, 48 lab rats were
randomly assigned to eight treatments. Each
treatment consisted of an arterial injection of
c14-labelled adenine together with a
concentration of caffeine (one of 0, 0.1, 0.5, 1,
5, 10, 25 or 50 mM or 10–3 moles per liter.
Shortly after injection, the concentration of
adenine in the rat brains was measured as
the response.
Example of a Block Design
Test the effect of three therapies on the survival rates of
groups of men and women.

.
Example: Cinnamon

A total of 60 people with type 2 diabetes, 30 men and 30 women


aged 45 to 59 years participated in an experiment. The men and
women were each divided randomly into six groups. Groups 1, 2,
and 3 consumed 1, 3, or 6 grams of cinnamon daily, respectively,
and groups 4, 5, and 6 were given placebo capsules
corresponding to the number of capsules consumed for the three
levels of cinnamon. The cinnamon was consumed for 40 days
followed by a 20-day washout period at which point blood glucose,
triglyceride, total cholesterol, HDL cholesterol, and LDL cholesterol
levels were measured for each of the people. It was found that
cinnamon reduces serum glucose, triglyceride, LDL cholesterol,
and total cholesterol in people with type 2 diabetes.
Matched pairs designs (repeated meaasures)
Matched pairs: Choose pairs of subjects that are closely matched—
e.g., same sex, height, weight, age, and race. Within each pair,
randomly assign who will receive which treatment.

It is also possible to just use a single person and give the two
treatments to this person over time in random order. In this case, the
“matched pair” is just the same person at different points in time.

The most closely


matched pair studies use
identical twins
Recognizing a Matched Pairs Design

• two measurements on the same number of


experimental units
• NB: the experimental units are either:
– the same for the two measurements (before
and after)
– made by the experimenter to be appear as
close as possible (twin studies)
• i.e. if there no treatments were applied there
should be a high positive correlation between the
two measurements

You might also like