This page provides tips on how to carry out bootstrapping and permutation tests in R by resampling data.
The bootstrap is mainly used in estimation. The method uses resampling with replacement to generates an approximate sampling distribution of an estimate. Standard errors and confidence intervals can be calculated using this bootstrap sampling distribution.
x
. The data may be numeric or categorical.
A single bootstrap replicate is obtained as follows. The replace option is used to indicate that sampling is carried out with replacement.
xboot <- sample(x, replace = TRUE)
Calculate the statistic of interest (for example, the mean) on the resampled data in xboot
and store the result in a vector created for this purpose.
z <- vector() # initialize z (do only once)
z[1] <- mean(xboot) # first bootstrap replicate estimate
mydata
is a different individual, and each column a different variable.
To resample individuals (i.e., rows),
iboot <- sample(1:nrow(mydata), replace = TRUE))
bootdata <- mydata[iboot,]
The data frame bootdata
will contain a single bootstrap replicate including all the variables.
Calculate the statistic of interest on the resampled data and store the result in vector created for this purpose. For example, to calculate the correlation between two variables x and y in bootdata
,
z <- vector() # initialize z (do only once)
z[1] <- cor(bootdata$x, bootdata$y)
Assume that the vector z
contains a large number of bootstrap replicate estimates of a statistic of interest. The bootstrap standard error of the statistic is obtained as the standard deviation of the bootstrap replicate estimates. (The most common mistake at this stage is to calculate the standard error of the bootstrap replicates rather than the standard deviation.)
sd(z)
The percentile method is often used to provide an approximate 95% confidence interval for the population parameter. The percentile method is not as accurate as the BCa method, but it is very quick and easy to calculate. Assume that the vector z
contains a large number of bootstrap replicate estimates of the statistic of interest. By the percentile method, a 95% boostrap confidence interval for the parameter is obtained as
quantile(z, probs = c(0.025,0.975))
A large number of bootstrap replicate estimates is required for an accurate confidence interval.
The bootstrap is often used to determine the bias of an estimate. Assume that the vector z
contains a large number of bootstrap replicate estimates of a statistic of interest. Also, assume that the same statistic computed on the original data is stored in an object named “estimate”. The bootstrap estimate of bias is calculated as
mean(z) - estimate
boot
packageThe boot library, included with the standard R installation, includes useful commands for bootstrapping. Notably, it will calculate confidence intervals using the BCa method, which is more accurate than those produced by the percentile method. BCa stands for “bias corrected and accelerated”. The method corrects for estimated bias and skewness. Consult Efron and Tibshirani (1998) for details.
To begin, load the boot library
library(boot)
To use the boot package you will need to write a function to calculate the statistic of interest. The format is illustrated below for the sample mean, but any univariate function would be handled similarly. We’ll call our function “boot.mean”. When you have finished writing the script for a function you will need to cut and paste it into the command window so that R can access it (you’ll need to do this just once in an R session). Here, x
refers to the vector of data. i
serves as a counter, as in your own for
loop, but it must be included as an argument in your function as shown.
boot.mean <- function(x,i){boot.mean <- mean(x[i])}
The command boot
will automatically carry out all the resampling and computations required. For this example, x
is the vector of original data and boot.mean
is the name of the function we created above to calculate the statistic of interest. R specifies the number of bootstrap replicate estimates desired.
z <- boot(x, boot.mean, R = 2000)
The resulting object (which here named z
) is a boot object containing all the results. Use the following additional commands to pull out the results.
print(z) # Bootstrap calculation of bias and SE
sd(z$t) # Another way to get the standard error
hist(z$t) # Histogram of boostrap replicate estimates
qqnorm(z$t) # Normal quantiles of replicate estimates
boot.ci(z, type = "bca") # 95% confidence interval using BCa
boot.ci(z, type = "perc") # Same using percentile method
Here’s how you would use the boot command If the statistic of interest must be calculated from two (or more) variables (for example, a correlation, regression slope, or odds ratio). Assume that there are two variables of interest, x
and y
. If not done already, put the two variables into a data frame (here called mydata
),
mydata <- cbind.data.frame(x, y, stringsAsFactors = FALSE)
Then create a function to calculate the statistic of interest on the variables in mydata
. For example, to create a function that calculates the correlation coefficient between the two variables x
and y
, use
boot.cor <- function(mydata,i){
x <- mydata$x[i]
y <- mydata$y[i]
boot.cor <- cor(x,y)
}
Here, i
refers to a vector of indices, which must be included as an argument in the function and employed as shown.
Finally, pass your data frame and function to the boot command,
z <- boot(mydata, boot.cor, R = 2000)
See the previous section for a list of commands to pull results from the boot object (here named z
).
A permutation test uses resampling and the computer to generate a null distribution for a test statistic. The test statistic is a measure of association between two variables or difference between groups, such as a slope, a correlation, or an odds ratio. Each permutation step involves randomly resampling without replacement the values of one of the two variables in the data and recalculating the test statistic in each permutation. The two variables may be categorical (character or factor), numeric, or one of each.
R has a built-in permutation procedure for a contingency test of association when both of two variables are categorical (call them A1 and A2). To apply it, execute the usual command for the χ2 contingency test, but set the simulate.p.value
option to TRUE. The number of replicates in the permutation is set by the option B
(default is 2000). Each permutation rearranges the values in the contingency table while keeping all the row and column totals fixed to their observed values.
chisq.test(A1, A2, simulate.p.value = TRUE, B = 5000)
x
). It doesn’t matter which of the two variables you choose. Keep the other variable (call it y
) unchanged (there is no benefit to resampling both variables).
Resample x
without replacement to create a new vector (call it x1).
x1 <- sample(x, replace = FALSE)
Calculate the test statistic to measure association between y
and the randomized variable x1. Store the result in a vector created for this purpose. For example, to calculate the correlation between the two variables,
z <- vector() # initialize z (do only once)
z[1] <- cor(x1, y) # first permutation result
© 2009-2025 Dolph Schluter