[go: up one dir, main page]

0% found this document useful (0 votes)
15 views4 pages

R Session Bootstrapping Randomisation 2024

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views4 pages

R Session Bootstrapping Randomisation 2024

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

EC 306

KI 2024

R – Session
Customising computer-intensive procedures

The programming capabilities of R can be combined with available statistical functions to carry out (1) non-
traditional statistical analyses and (2) simulations that allow you to explore the suitability of different
statistical techniques for specific data-analysis situations.

1. Bootstrapping
Resampling to construct uncertainty measures, e.g., 95% Confidence Interval, around parameter estimates

What do you do when you wish to present confidence intervals around a measure, such as the mean of a
sample, but the sample is not normally distributed? What do you do when you fit statistical models to your
data but the model assumptions are violated and therefore, while the parameter estimates are okay,
uncertainty measures that are distribution-based (e.g., standard error of model coefficents, confidence
intervals of model coefficients) cannot be trusted? What if you try out all the different models you have at
hand but still model assumptions are violated? The most robust option available (i.e. the one that is likely to
give you the most robust uncertainty measures) is non-parametric bootstrapping (for p values, you would use
permutation tests – described in the previous section).

The idea behind bootstrapping is that if you do not have any other information about a population, the
distribution of values in your sample is the best guide to the distribution in the population. The main
assumption is that your sample is a random sample of the population and a good representation of the
population. To get confidence limits around the statistic that you calculate from your sample, resample the
sample (with replacement). This approximates what would happen if your population was resampled. (The
population is to the sample what the sample is to the bootstrap samples)

The basic procedure is to generate N samples (with replacement) from the original dataset; perform the
analysis of interest on each sample and store the parameter estimates of interest; and then find the 95%
confidence intervals from the distribution of the parameter estimates.

The procedure described above is a non-parametric bootstrapping procedure, which makes the fewest
assumptions about the nature of the data. There are other bootstrapping procedures that make assumptions
about the nature of stochasticity in the data. REF explains these different procedures well.

The R script “randomisation_non-parametric_bootstrap_code.R” has examples of calculating bootstrapped


confidence intervals for (1) mean of a data set (2) model coefficients for a linear model with continuous
predictors (3) model coefficients for a linear model with continuous and categorical predictors

Calculating bootstrapped confidence intervals for coefficients in a mixed model requires that re-sampling
should be carried out keeping in mind the grouped structure (random effects) of the data. If you are interested
in this, please get in touch and I can send you example code for this.

2. Permutation (Randomization) tests


Statistical tests of hypotheses by creating null distribution of test statistics

We often have to deal with data that do not conform to assumptions of parametric tests. Let us suppose that
we are interested in the effect of some variable X on a trait Y whose distribution we are unsure of, and the
assumptions of ordinary regression analysis are likely to be violated, increasing Type I error. Let us also
suppose that we are unsure of the underlying distribution of Y either because there isn’t enough information
in our data or for theoretical reasons. Therefore, parametric tests other than ordinary regressions are also
ruled out.

1
EC 306
KI 2024

So, we turn to randomisation tests: our overall goal is to create a distribution of the test statistic (e.g., slope
of the line describing the relationship between X and Y) under the null hypothesis that there is no effect of X
on Y. We can then compare our observed test statistic with the null distribution we have created to assess its
significance.

To create a null distribution of the test statistic, we shuffle Y across X values. This breaks up any pattern
between X and Y and any significant result could only have been obtained by chance. If we repeat this
procedure many times, we get a distribution of the test statistic under the null hypothesis of no effect. We can
now compare the actual test statistic from the original data with the created null distribution to see how likely
we are to get our actual statistic under the null hypothesis (i.e., what is the probability of getting the observed
statistic or more extreme outcomes under the null hypothesis?).

The R script “randomisation_non-parametric_bootstrap_code.R” has examples of performing permutation


tests for (1) comparing two means (2) model coefficients for a linear model

3. Monte-Carlo simulations
You may sometimes wish to check whether the use of a particular statistical procedure is appropriate when
analysing your data. For example, does an ordinary regression perform well even when you expect error
variance to be strongly unequal? Can one use ordinary regression if the response variable is highly skewed?

To answer these questions, this would be the general procedure:


1. Generate artificial data according to the process you think might be occurring
2. Run your chosen analysis (or analyses) using these data
3. Repeat steps 1 and 2 many times to measure
a. Type I error (proportion of times P≤0.05 when the null hypothesis is true)
b. Statistical power (proportion of times P≤0.05 when the null hypothesis is false)
c. Confidence interval around parameter estimates to check precision

We’ll address the first question above to illustrate this approach. Both questions above are very simple
questions, but this procedure can be used to investigate much more complex situations.

Does ordinary regression perform well when error variance is strongly unequal?
One of the assumptions of simple regression analysis is that error variance is equal at all points along the
regression line. In reality, error variance often increases with the predictor variable and/or with the dependent
variable. Here, we will investigate a different cause of unequal error variance: that created by different
sample sizes.

Suppose we were interested in the relationship between the number of helpers in a group and offspring
growth rate in cooperatively breeding babblers. As an estimate of the growth rate of offspring in a group we
take an average of all offspring measured in each group. However, groups differ in the number of offspring
measured, say from 1 to 7. Clearly, an average taken from 7 offspring is more reliable (is subject to less
uncertainty or error) as an estimate for that group than is an average taken from 2 offspring. In other words,
given that there is a relationship between y and x, a y-value calculated by averaging a large sample is likely
to deviate less from the predicted line than is a y-value calculated by averaging a small sample. Simple
regression, however, permits no such systematic change in error variance. Are there serious consequences?

Our aim is to simulate the process that we think might underlie our data and test different potential statistical
techniques to see which one is most appropriate for our data situation. In the current example, we will
generate data in which the sample size for different observations varies, analyse these data using two
techniques (simple regression and weighted regression), and see which of these two techniques is better at
recovering the original simulated relationship. Imagine that the true intercept is 2g/hr and each additional
helper adds 0.3 to this (so the slope is 0.3g/hr/helper).

2
EC 306
KI 2024

1. Generate data
helper <- c(1,2,2,3,3,5,6,9,10,12) ## suppose these are the no. of
helpers
s.size <- c(1,1,1,1,2,3,4,6,7,7)
## no. of offspring measured in each group

To generate averages from different sample sizes we write a small loop


helper.fun <- function(int, slope, helper, s.size){
growth <- numeric(length=length(helper))
for (i in 1:length(helper)){
growth[i] <- mean(rnorm(s.size[i], int+slope*helper[i], sd=2))
}
growth
}
growth <- helper.fun(int=2, slope=0.3, helper=helper, s.size=s.size)
## these are our growth data
plot(helper, growth)

There are two ways we could analyse these data: by simple regression or by weighted regression. Weighted
regression can account for changes in error variance if we specify a series of weights. In this example, we
want to weight by sample size because we have more confidence in averages from larger samples.

2. Analyse data
summary(lm(growth~helper)) ## simple regression
summary(lm(growth~helper, weights=s.size)) ## weighted regression

Write a function to generate and analyse data (combine 1 and 2)


unequal.var.fun <- function(int, slope, helper, s.size, wts){
growth <- helper.fun(int=int, slope=slope, helper=helper,
s.size=s.size)
simple.p <- coef(summary(lm(growth~helper)))[8] ## P from ordinary
regr.
wtd.p <- coef(summary(lm(growth~helper, weights=wts)))[8]
## P from wtd regr.
simple.b <- coef(summary(lm(growth~helper)))[2] ## slope from ordinary
regr.
wtd.b <- coef(summary(lm(growth~helper, weights=wts)))[2]
## slope from wtd regr.
c(simple.p, wtd.p, simple.b, wtd.b) ## output of function}

3. Iterate 1 and 2 many times


iter <- 1000
pval.simple <- numeric(length=iter) ## create vectors to store
p.values
pval.wtd <- numeric(length=iter)
slope.simple <- numeric(length=iter) ## create vectors to store slope
values
slope.wtd <- numeric(length=iter)

for (n in 1:iter){
temp <- unequal.var.fun(int=2, slope=0.3, helper=helper,

3
EC 306
KI 2024

s.size=s.size, wts=s.size)
pval.simple[n] <- temp[1]
pval.wtd[n] <- temp[2]
slope.simple[n] <- temp[3]
slope.wtd[n] <- temp[4]
}

length(pval.simple[pval.simple<=0.05])/iter
## prop. of runs where P<=0.05:simple
length(pval.wtd[pval.wtd<=0.05])/iter ## prop. of runs where P<=0.05: wtd
quantile(slope.simple, probs=c(0.025, 0.975)) ## 95% CI of the slope
quantile(slope.wtd, probs=c(0.025, 0.975))

What do you conclude from these simulations? Will you use weighted
regression or simple regression to analyse your data?

You might also like