Resampling methods

This page provides tips on how to carry out bootstrapping and randomization tests in R by resampling data.

Bootstrap resampling

The bootstrap is mainly used in estimation. The method uses resampling wirh replacement to generates an approximate sampling distribution of an estimate. Standard errors and confidence intervals can be calculated using this bootstrap sampling distribution.

Resample data

Let's assume that the data are a sample of measurements for a single variable stored in a vector "x". The data may be numeric or categorical.

A single bootstrap replicate is obtained as follows. The replace option is used to indicate that sampling is carried out with replacement.

xboot <- sample(x, replace=TRUE)
Calculate the statistic of interest (for example, the mean) on the resampled data in "xboot" and store the result in a vector created for this purpose.

z <- vector() # initialize z (do only once)
z[1] <- mean(xboot) # first bootstrap replicate estimate
Repeat steps (1) and (2) many times. The result will be a large collection of bootstrap replicate estimates for subsequent analysis.

In other cases, two or more variables are measured on individuals (e.g., stem height, leaf area, petal diameter, etc). Assume that each row of a data frame "mydata" is a different individual, and each column a different variable.

To resample individuals (i.e., rows),
iboot <- sample(1:nrow(mydata), replace=TRUE))
bootdata <- mydata[iboot,]
The data frame "bootdata" will contain a single bootstrap replicate including all the variables.
Calculate the statistic of interest on the resampled data and store the result in vector created for this purpose. For example, to calculate the correlation between two variables x and y in "bootdata",
z <- vector() # initialize z (do only once)
z[1] <- cor(bootdata$x, bootdata$y)
Repeat steps (1) and (2) many times. The result will be a large collection of bootstrap replicate estimates for subsequent analysis.

Bootstrap standard error

Assume that the vector "z" contains a large number of bootstrap replicate estimates of a statistic of interest. The boostrap standard error of the statistic is obtained as the standard deviation of the bootstrap replicate estimates. (The most common mistake at this stage is to calculate the standard error of the boostrap replicates rather than the standard deviation.)

sd(z)

Confidence interval approximation using the percentile method

The percentile method is often used to provide an approximate 95% confidence interval for the population parameter. The percentile method is not as accurate as the BC_a method, but it is very quick and easy to calculate. Assume that the vector "z" contains a large number of bootstrap replicate estimates of the statistic of interest. By the percentile method, a 95% boostrap confidence interval for the parameter is obtained as

quantile(z, probs = c(0.025,0.975))

A large number of boostrap replicate estimates is required for an accurate confidence interval.

Bootstrap estimate of bias

The bootstrap is often used to determine the bias of an estimate. Assume that the vector "z" contains a large number of bootstrap replicate estimates of a statistic of interest. Also, assume that the same statistic computed on the original data is stored in an object named "estimate". The boostrap estimate of bias is calculated as

mean(z) - estimate

Using the boot package

The boot library, included with the standard R installation, includes useful commands for bootstrapping. Notable, it will calculate confidence intervals using the BC_a method, which is more accurate than those produced by the percentile method. BC_a stands for "bias corrected and accelerated". The method corrects for estimated bias and skewness. Consult Efron and Tibshirani (1998) for details.

To begin, load the boot library

library(boot)

Single variable

To use the boot package you will need to write a function to calculate the statistic of interest. The format is illustrated below for the sample mean, but any univariate function would be handled similarly. We'll call our function "boot.mean". When you have finished writing the script for a function you will need to cut and paste it into the command window so that R can access it (you'll need to do this just once in an R session). Here, "x" refers to the vector of data. "i" refers to a vector of indices, which must be included as an argument in the function and employed as shown.

boot.mean <- function(x,i){boot.mean <- mean(x[i])}

The command "boot" will automatically carry out all the resampling and computations required. For this example, "x" is the vector of original data and "boot.mean" is the name of the function we created above to calculate the statistic of interest. R specifies the number of bootstrap replicate estimates desired.

z <- boot(x, boot.mean, R=2000)

The resulting object (which here named "z") is a boot object containing all the results. Use the following additional commands to pull out the results.

print(z) # Bootstrap calculation of bias and SE
sd(z$t) # Another way to get the standard error

hist(z$t) # Histogram of boostrap replicate estimates

qqnorm(z$t) # Normal quantiles of replicate estimates

boot.ci(z,type="bca") # 95% confidence interval using BCa

boot.ci(z,type="perc") # Same using percentile method

Two or more variables

Here's how you would use the boot command If the statistic of interest must be calculated from two (or more) variables (for example, a correlation, regression slope, or odds ratio). Assume that there are two variables of interest, "x" and "y". If not done already, put the two variables into a data frame (here called "mydata"),

mydata <- cbind.data.frame(x, y, stringsAsFactors=FALSE)

Then create a function to calculate the statistic of interest on the variables in "mydata". For example, to create a function that calculates the correlation coefficient between the two variables "x" and "y", use

boot.cor <- function(mydata,i){

x <- mydata$x[i]

y <- mydata$y[i]

boot.cor <- cor(x,y)
}

Here, "i" refers to a vector of indices, which must be included as an argument in the function and employed as shown.

Finally, pass your data frame and function to the boot command,

z <- boot(mydata, boot.cor, R=2000)

See the previous section for a list of commands to pull results from the boot object (here named "z").

Randomization test

A randomization test uses resampling and the computer to generate a null distribution for a test statistic. The test statistic is a measure of association between two variables or difference between groups. Each randomization step involves randomly resampling without replacement the values of one of the two variables in the data. The two variables may be categorical (character or factor), numeric, or one of each.

Both variables categorical

R has a built-in randomization procedure for a contingency test of association when both of two variables are categorical (call them A1 and A2). To apply it, execute the usual command for the χ² contingency test, but set the "simulate.p.value" option to TRUE. The number of replicates in the randomization is set by the option "B" (default is 2000). Each randomization rearranges the values in the contingency table while keeping all the row and column totals fixed to their observed values.

chisq.test(A1, A2, simulate.p.value = TRUE, B = 5000)

Data are numeric

If one or both of the variables is numeric, then you will need to create a short loop to carry out the resampling necessary for the randomization test. Choose one of the two variables to resample (call it "x"). It doesn't matter which of the two variables you choose. Keep the other variable (call it "y") unchanged (there is no benefit to resampling both variables).

Resample "x" without replacement to create a new vector (call it x1).
x1 <- sample(x, replace = FALSE)
Calculate the test statistic to measure association between y and the randomized variable x1. Store the result in a vector created for this purpose. For example, to calculate the correlation between the two variables,
z <- vector() # initialize z (do only once)
z[1] <- cor(x1, y) # first randomization result
Repeat steps (1) and (2) many times. The result will be a large collection of replicates representing the null distribution of your test statistic.
Use this null distribution to calculate an approximate P-value for your test. This involves calculating the proportion of values in the null distribution that are as extreme or more extreme than the observed value of the test statistic (calculated from the real data). Whether this calculation must take account of just one or both tails of the null distribution depends on your test statistic.