Dealing with variables and hypotheses
Wojciech Fendler, M.D. Ph.D.
Department of Biostatistics and Translational Medicine
We formulate universal truths that will be
confirmed in the general population on the
basis of small samples
How to do it properly?
1. The cornerstone of experiment design – random selection
of samples
2. Good experimental design
The root of the problem?
The general concept of sampling
Population
Random sample
Unknown Relationship Yi ˆ0 ˆ1 X i ˆi
Yi 0 1X i i
Three things to remember in medical research
1. Anything, however unlikely, may happen due to chance
2. It is impossible to directly infer causation in the clinical setting
3. One rarely cares about the observed effect in the studied
sample, but rather about the potential to generalize the result
to the whole target population
Statistical hypotheses
• Parity
• Mutually complementary
imply that
• They cannot both occur
• They both include all possible outcomes
H0: x1 = x2 null hypothesis
HA: x1 x2 alternative hypothesis
Statistical hypotheses
• Hypothesis parity: „what is not false, it has to be true”
• Null hypothesis (‘straw man’ hypothesis, H0) most often is opposite to what
the researcher believes in (alternate hypothesis, HA)
H0: x1 = x2
HA: x1 x2
• Both H0 and HA must complement each other fully – no alternative options
Neyman, Pearson, Gosset
Simple example
• Is body height associated with gender?
H0 = height of men and women is equal
HA = height of men and women differs
• Why is it easier to reject the null hypothesis than
find universal evidence confirming the HA?
Why is it more convenient to formulate untrue
hypotheses as the starting point?
The truth
Guilty Not Guilty
Guilty Ok Innocent person
Our verdict sentenced
Not guilty Murderer goes Ok
free
H0 – Innocent until proven guilty
HA – Guilty as charged
How to formulate a statistical hypothesis ?
• in a majority of cases our null hypothesis is an opposite to
what is stated in research theory (Reject Support approach)
why is it so ?
• aimed at rejecting null hypothesis and accepting alternative
one – HA = research theory – H0 = opposite to research theory
Principles of hypothesis testing
• We can only reject H0 and never prove that it is true
• Non-significant result does not mean accepting H0, but merely its non-
rejecting at some particular conditions
• Even if we have no right to deny that
H0 is true, then we cannot reject it but it does not mean that it is true
When do we commit errors in hypothesis
testing?
The truth
Guilty Not Guilty
Guilty Ok type 1 error – H0
rejected
Our verdict incorrectly
Not guilty Type 2 error – false H0 Ok
not rejected
Type I statistical error – we reveal the non-existing (false) difference
Type II statistical error – we conceal the real (existing) difference
The meaning of errors in hypothesis testing
• Type 1 error (how to interpret):
– Rejection of a true null hypothesis – detection of a difference where there is
none
– False positive finding – discovery of an association which is purely by chance
• Type 2 error:
– Not rejecting a false null hypothesis – failure to detect a true effect
– False negative result – failure to discover the effet in a properly planned study
How do we combat the errors of hypothesis
testing?
• Type 1 error
– Plan the study properly
– Use adequate statistical tests
– …be lucky
• Type 2 error
– Plan a sufficiently large group
– Use adequate statistical tests
– …be lucky
The acceptable probability of errors (type 1 - a)
• For type 1 error – it is generally assumed that a
5% probability of this error is the maximum
tolerable margin allowing the researcher to reject
the null hypothesis
– Results of statistical tests that show the null
hypothesis be true with a probability <0.05 are
considered „statistically significant”
P value
• Post-test probability that we reject the null hypothesis
incorrectly
• A low p value means that the results are more likely to be
due to actual effects rather than a chance observation
• Typically we consider p values <0.05 as „statistically
significant” which translates to be a sufficient weight of
evidence to claim that the observed effect is an actual fact
What if we want the differences to be non-
existent?
„Absence of evidence is not
evidence of absence”
C. Sagan
The acceptable probability of errors (type 2 - )
• For type 2 error – it is generally assumed that the admissible probability of
this error is 20%
• The lower the better, but lowering it escalates the number of samples and
cost
• 1- is called statistical power – the probability that the study will be able to
reject a wrong null hypothesis (i.e. the probability of not making a type 2
error)
RS and AS approach to hypthesis testing
• Reject-support (RS) testing
– rejection of H0 supports research theory
• Accept-support (AS) testing
– non-rejecting H0 supports research theory
Comparison between RS and AS testing
RS testing AS testing
interested to reject H0 not interested to reject H0
care about low a care about low
large sample size large sample size
beneficial disadvantageous
high power important for society low a important
Statistical inference
The process of deducing the
underlying distribution by analysis
of data. Inferential analysis provides
the properties of a variable in the
population by testing hypotheses
and deriving estimates.
Basic statistical terms
Variables (also known as features, characteristics) – values
that are monitored or measured, being under control of or
manipulated by researcher in a course of study
May be classified according to a given criterion
Variables
• Independent variables – variables that may be controlled and/or
modified (manipulated) by a researcher in an experiment
• Dependent variables – variables that may be merely monitored
and measured by a researcher, cannot be manipulated or changed;
researcher does not affect these values
Discrete variables
• Example: responses to a five-point rating scale can only take on
the values 1, 2, 3, 4, and 5
• All qualitative variables are discrete; some quantitative variables
are discrete, such as performance rated as 1,2,3,4, or 5, or
temperature rounded to the nearest degree
• Sometimes, a variable that takes on enough discrete values can be
considered to be continuous for practical purposes
• Example: time rounded to the nearest millisecond
Continuous variables
• Examples: Length, weight, concentration, time, and the points on a
line are continuous variables
• The variable "Time to solve an anagram problem" is continuous since
it could take 2 minutes, 2.13 minutes etc. to finish a problem
• The variable "Number of correct answers on a 100 point multiple-
choice test" is not a continuous variable since it is not possible to get
54.12 problems correct
Variables
Independent variables
• Variables that may be controlled and/or modified (manipulated)
by a researcher in an experiment
Dependent variables
• Variables that may be merely monitored and measured by a
researcher, cannot be manipulated or changed; researcher does
not affect these values
Types of variables
What are the implications of such a
characteristics ?
Different tests are employed to manage with
different types of variables
Measurement scales
Measurement is the assignment of numbers to
objects or events in a systematic fashion
Four levels of measurement scales are commonly
distinguished:
• nominal
• ordinal
• interval
• ratio
Measurement scales - nominal
Nominal measurement - consists of assigning items to groups or categories
No quantitative information is conveyed and no ordering of the items is implied
Nominal scales are therefore qualitative rather than quantitative
Variables measured on a nominal scale are often referred to as categorical or
qualitative variables
Examples:
- religious preference
- race
- sex
- living in a village or a city
Measurement scales - ordinal
Ordinal measurements - are ordered in the sense that higher numbers represent
higher values, although the intervals between the numbers are not necessarily equal
Example:
- NYHA score of cardiac insufficiency
- A 4-grade rating scale:
- Grade I symptoms after exertion
- Grade II symptoms after moderate exertion
- Grade III symptoms after light exertion
- Grade IV symptoms at rest (indication for heart transplant)
- A change of 1 grade is an improvement but of different magnitude
Measurement scales - interval
Interval scale – the scale, on which the intervals between the numbers are
equal; one unit on the scale represents the same magnitude on the trait or
characteristic being measured across the whole range of the scale
Interval scale does not have a “true” zero point – it is not possible to make
statements on how many times one valus is higher from the other
Interval scales continued
Examples:
- The Fahrenheit scale for temperature; equal differences on this scale represent
equal differences in temperature, but a temperature of 30 degrees is not twice as
warm as one of 15 degrees
- Anxiety scale of behaviour; if anxiety were measured on an interval scale, then a
difference between a score of 10 and a score of 11 would represent the same
difference in anxiety as would a difference between a score of 50 and a score of 51
Measurement scales - ratio
Ratio scale – is like interval scales except it has a true zero point
Examples:
- Kelvin scale of temperature, which has an absolute zero; the temperature of
300 Kelvin is twice as high as a temperature of 150 Kelvin
The majority of continuous variables values represent either ratio or interval
scales
When monitoring variables …
… we take care of:
Precision – the degree of repeatability of measurements in a
series
Validity/accuracy – how far a measurement reflects what it
suppose to reflect (internal validity)
Precision and accuracy
High precision, Low accuracy,
high accuracy high precision
Low accuracy
and low
precision
Low precision,
high accuracy
Precision
(repeatability)
How estimated ? – by comparing measurements and estimating
their repeatability in a series
How significant it is for an investigation? – Increases the chance
to detect real differences between groups, because reduces within-
group variability (statistical power)
Why is reduced ? – random errors
Accuracy
(internal validity)
How is it estimated ? – by comparing values of measurements with
‘golden standard’ or reference values
How significant it is for an investigation? – Increases reliability of
the results
Why is reduced ? – systematic errors
Measures of central tendency
– which should be used and when ?
• mean (arithmetic)
• geometric mean
• harmonic mean
• median
• mode
• minimum, maximum
Mean
• The arithmetic mean - distinguished from the geometric mean or
harmonic mean
• The expected value of a random variable, which is also called the
population mean
Measures of location - Median
example
The median is the arithmetic mean of the two middle values in a
series with an even number of elements or the value of the
middle element in the ordered list of values
10 12 13 14 18 24 25 80 89 90 120 140 145
If the number of values is even, the median is the mathematical
average of the two middle values
10 12 13 14 18 24 25 26 80 89 90 120 140 145
Mode
• The value that occurs the most frequently
• Has to occur more frequently than other values
• There can be several modes, provided that they occur
more frequently than other values and equally frequently
with each other
• Examples:
– 1, 2, 3, 4, 4, 5, 6, 7, 10 (Mode 4)
– 1, 2, 3, 4, 4, 5, 5, 6, 7, 10 (Modes 4 AND 5)
Statistics of central tendency
- SD + SD
Mean/ Median / Mode Mode Mean
Median
Measures of dispersion
Statistical dispersion (also called statistical variability or variation) is variability or
spread in a variable or a probability distribution
used to express the variability of a given characteristic in the studied
population
• variance
• standard deviation (SD)
• standard error (of the mean) SE(M)
• variability coefficient
• agreement
• quantiles, quartiles
Variance – the total variability of
the variable
Continuous case
If the random variable X is continuous with probability density function p(x),
where
and where the integrals are definite integrals taken for x ranging over the range of X.
Sum of squared differences from the mean of a continuous variable
Standard deviation
The standard deviation of a statistical population, a data set, or a probability
distribution is the square root of its variance. Standard deviation is a widely
used measure of the variability or dispersion, being algebraically more tractable
though practically less robust than the expected deviation or average absolute
deviation
Value of SD represents the average difference from the mean in the group
Standard deviation
A data set with a mean of 50 (shown in blue) and a standard deviation (σ) of 20
Normal (Gaussian) distribution
0.5% 0.5%
0.1% 0.1%
Standard error (of mean)
Depends on sample size and SD
Used as a measure of precision for estimating the
true value of the mean
When SD and when SEM ?
• Standard deviation – measure of within-population
variability
• Standard error (of mean) – measure of imprecision in
technical replicates
The use of SEM for comparing groups (controls-cases) does not make a
sense and neither does using SD for calibration curves
What is the median?
What will you get if you divide the
upper and lower halves of an ordered
data series by their respective medians?
Interquartile range
The interquartile range (IQR), also called the midspread or middle fifty, is a
measure of statistical dispersion, being equal to the difference between the third
and first quartiles
IQR = Q3 − Q1
The median is the corresponding measure of central tendency
Used together with the median for describing non-normally distributed samples
Interquartile range
i x[i] Quartile
1 102
2 104
3 105 Q1
4 107
5 108
6 109 Q2 (median)
7 110
8 112
9 115 Q3
10 116
11 118
How do we read measures of central
tendency and dispersion ?
Middle line –
mean
Box – standard
deviation
Total cholesterol
How do we read measures of central
tendency and dispersion ?
Middle line –
median
Box – upper
and lower
quartiles
Thank you for your attention