This page provides tips on how to carry out bootstrapping and permutation tests in R by resampling data.
The bootstrap is mainly used in estimation. The method uses resampling with replacement to generates an approximate sampling distribution of an estimate. Standard errors and confidence intervals can be calculated using this bootstrap sampling distribution.
x
. The data may be numeric
or categorical.
A single bootstrap replicate is obtained as follows. The replace option is used to indicate that sampling is carried out with replacement.
xboot <- sample(x, replace = TRUE)
Calculate the statistic of interest (for example, the mean) on the
resampled data in xboot
and store the result in a vector
created for this purpose.
z <- vector() # initialize z (do only once)
z[1] <- mean(xboot) # first bootstrap replicate estimate
mydata
is a different individual, and
each column a different variable.
To resample individuals (i.e., rows),
iboot <- sample(1:nrow(mydata), replace = TRUE))
bootdata <- mydata[iboot,]
The data frame bootdata
will contain a single bootstrap
replicate including all the variables.
Calculate the statistic of interest on the resampled data and store
the result in vector created for this purpose. For example, to calculate
the correlation between two variables x and y in
bootdata
,
z <- vector() # initialize z (do only once)
z[1] <- cor(bootdata$x, bootdata$y)
Assume that the vector z
contains a large number of
bootstrap replicate estimates of a statistic of interest. The bootstrap
standard error of the statistic is obtained as the standard
deviation of the bootstrap replicate estimates. (The most common
mistake at this stage is to calculate the standard error of the
bootstrap replicates rather than the standard deviation.)
sd(z)
The percentile method is often used to provide an approximate 95%
confidence interval for the population parameter. The percentile method
is not as accurate as the BCa method, but it is very quick
and easy to calculate. Assume that the vector z
contains a
large number of bootstrap replicate estimates of the statistic of
interest. By the percentile method, a 95% boostrap confidence interval
for the parameter is obtained as
quantile(z, probs = c(0.025,0.975))
A large number of bootstrap replicate estimates is required for an accurate confidence interval.
The bootstrap is often used to determine the bias of an estimate.
Assume that the vector z
contains a large number of
bootstrap replicate estimates of a statistic of interest. Also, assume
that the same statistic computed on the original data is stored in an
object named “estimate”. The bootstrap estimate of bias is calculated
as
mean(z) - estimate
boot
packageThe boot library, included with the standard R installation, includes useful commands for bootstrapping. Notably, it will calculate confidence intervals using the BCa method, which is more accurate than those produced by the percentile method. BCa stands for “bias corrected and accelerated”. The method corrects for estimated bias and skewness. Consult Efron and Tibshirani (1998) for details.
To begin, load the boot library
library(boot)
To use the boot package you will need to write a function to
calculate the statistic of interest. The format is illustrated below for
the sample mean, but any univariate function would be handled similarly.
We’ll call our function “boot.mean”. When you have finished writing the
script for a function you will need to cut and paste it into the command
window so that R can access it (you’ll need to do this just once in an R
session). Here, x
refers to the vector of data.
i
serves as a counter, as in your own for
loop, but it must be included as an argument in your function as
shown.
boot.mean <- function(x,i){boot.mean <- mean(x[i])}
The command boot
will automatically carry out all the
resampling and computations required. For this example, x
is the vector of original data and boot.mean
is the name of
the function we created above to calculate the statistic of interest. R
specifies the number of bootstrap replicate estimates desired.
z <- boot(x, boot.mean, R = 2000)
The resulting object (which here named z
) is a
boot object containing all the results. Use the following
additional commands to pull out the results.
print(z) # Bootstrap calculation of bias and SE
sd(z$t) # Another way to get the standard error
hist(z$t) # Histogram of boostrap replicate estimates
qqnorm(z$t) # Normal quantiles of replicate estimates
boot.ci(z, type = "bca") # 95% confidence interval using BCa
boot.ci(z, type = "perc") # Same using percentile method
Here’s how you would use the boot command If the statistic of
interest must be calculated from two (or more) variables (for example, a
correlation, regression slope, or odds ratio). Assume that there are two
variables of interest, x
and y
. If not done
already, put the two variables into a data frame (here called
mydata
),
mydata <- cbind.data.frame(x, y, stringsAsFactors = FALSE)
Then create a function to calculate the statistic of interest on the
variables in mydata
. For example, to create a function that
calculates the correlation coefficient between the two variables
x
and y
, use
boot.cor <- function(mydata,i){
x <- mydata$x[i]
y <- mydata$y[i]
boot.cor <- cor(x,y)
}
Here, i
refers to a vector of indices, which must be
included as an argument in the function and employed as shown.
Finally, pass your data frame and function to the boot command,
z <- boot(mydata, boot.cor, R = 2000)
See the previous section for a list of commands to pull results from
the boot object (here named z
).
A permutation test uses resampling and the computer to generate a null distribution for a test statistic. The test statistic is a measure of association between two variables or difference between groups, such as a slope, a correlation, or an odds ratio. Each permutation step involves randomly resampling without replacement the values of one of the two variables in the data and recalculating the test statistic in each permutation. The two variables may be categorical (character or factor), numeric, or one of each.
R has a built-in permutation procedure for a contingency test of
association when both of two variables are categorical (call them A1 and
A2). To apply it, execute the usual command for the χ2
contingency test, but set the simulate.p.value
option to
TRUE. The number of replicates in the permutation is set by the option
B
(default is 2000). Each permutation rearranges the values
in the contingency table while keeping all the row and column totals
fixed to their observed values.
chisq.test(A1, A2, simulate.p.value = TRUE, B = 5000)
x
). It doesn’t matter which of the two variables you
choose. Keep the other variable (call it y
) unchanged
(there is no benefit to resampling both variables).
Resample x
without replacement to create a new vector
(call it x1).
x1 <- sample(x, replace = FALSE)
Calculate the test statistic to measure association between
y
and the randomized variable x1. Store the result in a
vector created for this purpose. For example, to calculate the
correlation between the two variables,
z <- vector() # initialize z (do only once)
z[1] <- cor(x1, y) # first permutation result
© 2009-2024 Dolph Schluter