[go: up one dir, main page]

0% found this document useful (0 votes)
20 views61 pages

Lecture 1 Intro

Uploaded by

danielfaria8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views61 pages

Lecture 1 Intro

Uploaded by

danielfaria8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Introduction to Biostatistics

Brian Healy, PhD


Lecture outline
• Course introduction
• STATA introduction
• Types of data
• Summary statistics
– Location: mean, median
– Spread: standard deviation, variance, range
• Graphics
– Box plot, histogram
Welcome!!!
• Brian Healy
– Assistant professor at HMS
– Research focus: Multiple sclerosis
– Most of my example will be based on MS or the
Framingham Heart Study dataset
• Course format: Lecture/Web learning
About course
• Course objectives
– How to think statistically
– How to choose the correct statistical analysis
– Focus on concepts and how to apply the concepts
in a statistical package
– When you need more help
Course topics
• Ten lectures
– Data description
– Estimation/confidence intervals
– Hypothesis testing
– Power and sample size
– Analysis of proportions
– Nonparametric tests
– Linear regression
– Logistic regression
– Survival analysis
Homework
• Practicum: Independent assignment related to
the topic of the week, focusing on using the
statistical package
– First half is a demonstration of the technique in
STATA
– Second half is a practice problem
– Every three lectures you will be answer a set of
multiple choice questions from the practicum to
demonstrate that you have completed the work
Textbooks
The main reference for this course will Fundamentals of
Biostatistics by Rosner
Potential books of interest for additional topics
– Applied Regression Analysis and Other Multivariable
Methods by Kleinbaum et al
– Applied Longitudinal Analysis by Fitzmaurice, Laird
and Ware
– Applied Survival Analysis by Hosmer, Lemeshow and
May
– Applied Logistic Regression by Hosmer and Lemeshow
Introduction to STATA
Introduction to biostatistics
Outline for today
• Research study
• Types of data
• Measures of location
• Measures of spread
• Graphics
• Basic probability
Goals of lecture
• At the end of this lecture, you will be able to:
– Identify different types of data
– Identify the ways to summarize and visualize each
type of data
– Introduce the normal distribution
Big picture
• A research study involves careful planning
prior to any data collection or analysis
• A study is only as good as the study design
• Prior to completing any data analysis, you
should inspect your data using summary
statistics and graphics
Research study
I. Study design
• Experimental question- What are you trying to
learn? How will you prove this?
• Sample selection- Who are you going to study?
II. Data collection
• What should be collected?
III. Analysis of data
• Results- Was there any effect?
• Conclusions- What does this all mean? To whom do
results apply?
How is statistics related to each stage?
I. Study design
• Experimental question- Define outcome, sources of
variability, and analysis plan
• Sample selection- Sample size, type of sample
II. Data collection
• What to collect?
III. Analysis of data
• Results- Hypothesis test
• Conclusion- Significance of effect/generalizability
Population vs. sample
• When we complete a research study, we are
usually collecting a sample of people or
animals that come from a larger population
– We usually hope that our sample is a
representative and random sample from the
population so that we can make an inference
about the population based on our sample
• This concept is critical and we will come back
to it many times (Lecture 2)
Description vs. inference
• The first step in any study is to describe the
data that has been collected
• The goal of most studies is to also use the data
to infer something about a larger group
– We will describe how we do statistical inference
later in the course
Data description
Rosner: Chapter 2
Example
• The Framingham Heart Study enrolled people
starting in 1948 and people were followed
longitudinally over time
– Discussed by Dr. Cook
• We will focus on the baseline data in this class
– Several outcome variables
– Sample size is 4434
Example II
• My research focus is data analysis of multiple
sclerosis (MS)
• At the Partners MS Center, we have a
longitudinal study of patients (CLIMB)
• A subgroup of this study are given
questionnaires related to quality of life
– Patient reported outcomes
– Sample size is 252
Variables
• A variable is something that is measured in all
of the people/ in our sample
• Common variables in experiments are:
– Age
– Disease grade-an ordinal scale
– Presence/absence of disease
– Blood pressure
– Time to event
Types of variables
• Continuous variable: Age, blood pressure
• Dichotomous (binary) variable: Dead/alive,
Wild type/mutant
• Nominal/categorical variable: Race, group
membership
• Ordinal variable: Mild/Moderate/Severe, level
of stat knowledge
• Count variable: Number of lesions
• Time to event variable: Time to death
Ways to express data
• All data has a distribution that we would like
to describe
• Numerical summaries
– Summary statistics
– Summarize a large group of numbers with one or
two numbers
• Graphics
– Many types of graphs help provide a better
understanding of a dataset
Continuous variables
Age (years)

800
• Summary

600
statistics

Frequency

400
– Location

200
• Mean

0
30 40 50 60 70
age (years) at examination

• Median
– Variability Age (years)

• Standard 70
60
age (years) at examination

deviation
50

• Graphs
40
30
Mean
• The most common measure of location is the
sample mean or average
• To calculate the mean of a group of numbers,
we add all of the numbers and divide by the
total number of numbers

• 𝑥̅ =
• The mean age in our sample is 49.9 years
Standard deviation
• The most common measure of spread is the sample
standard deviation
• The sample standard deviation squared is the sample
variance
• Each describes the deviation of points from the mean
∑ ̅
• 𝑠 =

∑ ̅
• s=
• The sample standard deviation is 8.7
Median
• The median is the middle number or 50th
percentile
• To calculate the median, we list the number in
order and we find the middle
• Median is sometimes preferred when there is
a large skew in the data
• The median age in our sample is 49
Interquartile range and range
• The interquartile range (IQR) is the distance
between the 25th and 75th percentile
• The range is the distance between the
minimum and maximum
• The IQR in our sample is 57-42=15
• The range in our sample is 70-32=38
STATA
• All of these statistics are easily calculated in
STATA
STATA command
. summarize age, detail

age (years) at examination

Percentiles Smallest
1% 35 32
5% 37 33
10% 39 33 Obs 4434
25% 42 33 Sum of Wgt. 4434

50% 49 Mean 49.9258


Largest Std. Dev. 8.676929
75% 57 69
90% 62 69 Variance 75.2891
95% 64 70 Skewness .1923766
99% 67 70 Kurtosis 1.973587

STATA menus: Statistics\Summaries, tables, and tests\Summary and


descriptive statistics\Summary statistics
Histogram
• A histogram shows the distribution of
continuous data by breaking the data into bins
and showing the frequency in each bin
• From a histogram, you can usually get a
reasonable understanding of:
– Mean and standard deviation
– Symmetry/skewness
Example
• On this histogram

Age (years)
800

Standard deviation
600
Frequency

400
200
0

30 40 50 60 70
age (years) at examination

Sample mean=49.9

STATA command: histogram age, bin(10) frequency title(Age (years))


Skewed example
Symmetric distribution Skewed distribution

Median age=46 Median age=27


Histogram of age Histogram of depression score
40

80
30

60
Frequency

Frequency
20

40
10

20
0

20 30 40 50 60 70 20 30 40 50 60 70
Age Depression score

Mean age=45.7 Mean score=28.7


Relationship between mean and
median
• Symmetric distribution: mean and median are
very close and provide similar information
• Skewed distribution/outliers: mean can be
affected by high or low values more than
median
Box plot
• A box plot also
Age (years)
shows the Maximum

70
distribution of 75th

60
age (years) at examination

percentile
continuous Median
data 50
40
25th
percentile
• Shows median Minimum
30

and IQR
Aside
• One ancillary benefit of computing these
summary statistics and graphics is errors in
data entry can be more easily identified
• It is critical to ensure that your data are
"clean" before you proceed with any analysis
Dichotomous variables
• Dichotomous (binary) variables represent two
categories
• The most common way to summarize a
dichotomous variable is as a proportion
• For example, in our sample the proportion of
men is 1944/4434
• 𝑝̂ = = 0.44

• 𝑝̂ = = 0.56
Proportion calculation in STATA
• To calculate a proportion in STATA, you can use
many commands, including
. tabulate sex

sex Freq. Percent Cum.

1 1,944 43.84 43.84


2 2,490 56.16 100.00

Total 4,434 100.00


Nominal/categorical variable
• Nominal or categorical variables have multiple
levels, but these levels are not ordered
• Dichotomous is a special case of nominal
• Examples:
– Race
– Hair color
– Treatment group
Ordinal variable
• Ordinal variables are the most challenging type
of data in my opinion
• Two main types of ordinal data
– Order but no magnitude
• Mild, moderate, severe
– Order and information about magnitude
• Fatigue score
Time to event
• Survival Kaplan-Meier survival estimates

1.00
time

0.75
– Median
0.50
0.25

• Graph
0.00

– Kaplan- 0 2000 4000


analysis time
6000 8000

Meier sex = 1 sex = 2

curve
Description vs. comparison
• In many instances, description of the outcome
variable is the focus
– Estimate and confidence interval
• Description is not enough, rather comparison
is of interest
• What do we need for comparison?
– Second variable-usually called explanatory
variable or predictor
Contingency table
• When we are investigating the relationship
between two dichotomous variables, we often
construct a 2x2 contingency table
• We will discuss these in detail in Lecture 5

Success Failure Total


Male a b n1
Female c d n2
Total m1 m2 N
Types of analysis-independent samples

Outcome Explanatory Analysis

Continuous Dichotomous t-test

Continuous Continuous Correlation, linear


regression
Ordinal Dichotomous Wilcoxon test
Dichotomous Dichotomous Chi-square test,
logistic regression
Dichotomous Continuous Logistic regression

Time to event Dichotomous Log-rank test


Normal distribution

Rosner: Chapter 5.1-5.5


Example
• One of the most important probability
distributions in biostatistics is the normal
distribution
• Several measurements follow an
approximately normal distribution
– Blood pressure
– Birth weight
Picture
• Assume blood pressure in middle aged men
has a normal distribution mean 80 and
standard deviation 12 (from Rosner)
Properties of normal distribution
• Continuous random variable
• Range (-inf,inf)
• Two parameters
– Mean:
– Variance: 2

• Symmetric
Probability
• Our focus in biostatistics will often be
calculating the probability of a specific event
• For the normal, our focus will be the
probability of being between two values or
less than/greater than a value
– For example, in the blood pressure example, we
might want to know the probability of having a
blood pressure less than 70
Normal distributions
• A great feature of the normal distribution is
that any normal distribution can be converted
into a standard normal distribution by
subtracting the mean and dividing by the
standard deviation
• 𝑍=
Standard normal distribution
• Mean=0
• SD=1
• "z-score" or "z-
statistic"
• To calculate
probabilities, we use
the area under curve
• P(Z<=0)=0.5
Standard normal distribution
In a normal distribution
68% of the data lie within
16% of the
one standard deviation of
16% of the
data fall in data fall in the mean
this area this area
𝑃 −1 ≤ 𝑍 ≤ 1 = 0.68
𝑃 𝑍 ≤ −1 = 0.16
𝑃 𝑍 ≥ 1 = 0.16
𝑃 𝑍≤1 =
68% of the data fall 𝑃 −1 ≤ 𝑍 ≤ 1 +
in the shaded area
𝑃 𝑍 ≤ −1 = 0.84
Standard normal distribution
• In a normal distribution
about 95% of the data lie
within two standard
2.5% of the 2.5% of the deviations of the mean
data fall in data fall in
this area this area • 𝑃 −1.96 ≤ 𝑍 ≤ 1.96 =
0.95
• 𝑃 𝑍 ≤ −1.96 = 0.025
• 𝑃 𝑍 ≥ 1.96 = 0.025
• 𝑃 𝑍 ≤ 1.96 =
95% of the data fall
in the black area
𝑃 −1.96 ≤ 𝑍 ≤ 1.96 +
𝑃 𝑍 ≤ −1.96 = 0.975
Tail probabilities
• The previous examples have described the
probabilities associated with specific events
• The lower tail probability for all values of a
standard normal, 𝑃(𝑍 ≤ 𝑧), have been
calculated
• This result is useful in calculating the
probability of specific events and will help in
calculating p-values
Examples
• The probability of any other
event can be calculated as the
appropriate area (blue)
• 𝑃 −1.96 ≤ 𝑍 ≤ 0 =
𝑃 𝑍 ≤ 0 − 𝑃 𝑍 ≤ −1.96 =
0.5 − 0.025 = 0.475
Summary
• The probability of events from a normal
distribution can be calculated
• STATA can be used to calculate the tail
probabilities using this command:
– display normal(0)
– display normal(-1.96)
– display normal(0) -normal(-1.96)
Other normal distributions
• Since any normal distribution can be
converted into a standard normal distribution
by subtracting the mean and dividing by the
standard deviation, we can use this
relationship to calculate the probability of
events using the same procedure described
above
• 𝑍=
Example
• Assume the distribution of blood pressure in
middle aged men is normal with 𝜇 = 80 and
𝜎 = 12
• How could we calculate the probability that a
man has a blood pressure between 68 and 92?
– 𝑃 68 ≤ 𝑋 ≤ 92 = 𝑃 ≤ ≤ =

–𝑃 ≤𝑍≤ =
– 𝑃 −1 ≤ 𝑍 ≤ 1 = 0.68
Picture
• The amount of area
between 68 and 92 in
this graph is exactly
the same as between
-1 and 1 on the
standard normal
Additional examples
• To calculate the probability of other events we
use the same idea
– What is the probability that a male has blood
pressure less than 56?
• 𝑃 𝑋 ≤ 56 = 𝑃 𝑍 ≤ = 𝑃 𝑍 ≤ −2 = 0.023

Calculating the tail probability


like this is going to be very
similar to our approach for
calculating the p-value
Additional examples
• What is the probability that a male has blood
pressure more than 100?
– 𝑃 𝑋 ≥ 100 =
–𝑃 𝑍= ≥ =
– 𝑃 𝑍 ≥ 1.67 =
– 1 − 𝑃 𝑍 ≤ 1.67 = 0.048
Summary
• The probability of events from any normal
distribution can be calculated using the
relationship to a standard normal
• We will use this approach to calculate p-values
and confidence intervals in future classes
What we learned
• During this lecture we discussed
– Different types of data
– Ways to summarize and visualize each type of
data
– Normal distribution

You might also like