Resampling methods
This page provides tips on how to carry out bootstrapping and randomization tests in R by resampling data.Bootstrap resampling
The bootstrap is mainly used in estimation. The method uses resampling wirh replacement to generates an approximate sampling distribution of an estimate. Standard errors and confidence intervals can be calculated using this bootstrap sampling distribution.Resample data
Let's assume that the data are a sample of measurements for a single variable stored in a vector "x". The data may be numeric or categorical.- A single bootstrap
replicate is obtained as follows. The replace option is used to
indicate that sampling is carried out with replacement.
xboot <- sample(x, replace=TRUE)
-
Calculate the statistic of interest (for example, the mean) on the resampled data
in
"xboot" and store the result in a vector created for this
purpose.
z <- vector() # initialize z (do only once)
z[1] <- mean(xboot) # first bootstrap replicate estimate - Repeat steps (1) and (2) many times. The result will be a large collection of bootstrap replicate estimates for subsequent analysis.
- To resample individuals (i.e., rows),
iboot <- sample(1:nrow(mydata), replace=TRUE))
bootdata <- mydata[iboot,]
The data frame "bootdata" will contain a single bootstrap replicate including all the variables. - Calculate the statistic of interest
on the resampled data and store the result in vector created
for
this purpose. For example, to calculate the correlation between two variables x and y in "bootdata",
z <- vector() # initialize z (do only once)
z[1] <- cor(bootdata$x, bootdata$y) - Repeat steps (1) and (2) many times. The result will be a large collection of bootstrap replicate estimates for subsequent analysis.
Bootstrap standard error
Assume that the vector "z" contains a large number of bootstrap replicate estimates of a statistic of interest. The boostrap standard error of the statistic is obtained as the standard deviation of the bootstrap replicate estimates. (The most common mistake at this stage is to calculate the standard error of the boostrap replicates rather than the standard deviation.)sd(z)
Confidence interval approximation using the percentile method
The percentile method is often used to provide an approximate 95% confidence interval for the population parameter. The percentile method is not as accurate as the BCa method, but it is very quick and easy to calculate. Assume that the vector "z" contains a large number of bootstrap replicate estimates of the statistic of interest. By the percentile method, a 95% boostrap confidence interval for the parameter is obtained asquantile(z, probs = c(0.025,0.975))
A large number of boostrap replicate estimates is required for an accurate confidence interval.
Bootstrap estimate of bias
The bootstrap is often used to determine the bias of an estimate. Assume that the vector "z" contains a large number of bootstrap replicate estimates of a statistic of interest. Also, assume that the same statistic computed on the original data is stored in an object named "estimate". The boostrap estimate of bias is calculated asmean(z) - estimate
Using the boot package
The boot library, included with the standard R installation, includes useful commands for bootstrapping. Notable, it will calculate confidence intervals using the BCa method, which is more accurate than those produced by the percentile method. BCa stands for "bias corrected and accelerated". The method corrects for estimated bias and skewness. Consult Efron and Tibshirani (1998) for details.To begin, load the boot library
library(boot)
Single variable
To use the boot package you will need to write a function to calculate the statistic of interest. The format is illustrated below for the sample mean, but any univariate function would be handled similarly. We'll call our function "boot.mean". When you have finished writing the script for a function you will need to cut and paste it into the command window so that R can access it (you'll need to do this just once in an R session). Here, "x" refers to the vector of data. "i" refers to a vector of indices, which must be included as an argument in the function and employed as shown.boot.mean <- function(x,i){boot.mean <- mean(x[i])}
The command "boot" will automatically carry out all the resampling and computations required. For this example, "x" is the vector of original data and "boot.mean" is the name of the function we created above to calculate the statistic of interest. R specifies the number of bootstrap replicate estimates desired.
z <- boot(x, boot.mean, R=2000)
The resulting object (which here named "z") is a boot object containing all the results. Use the following additional commands to pull out the results.
print(z)
# Bootstrap calculation of bias and SE
sd(z$t)
# Another way to get the standard error
hist(z$t) # Histogram of boostrap replicate estimates
qqnorm(z$t) #
Normal quantiles of replicate estimates
boot.ci(z,type="bca") # 95% confidence interval using BCa
boot.ci(z,type="perc") # Same using percentile method
Two or more variables
Here's how you would use the boot command If the statistic of interest must be calculated from two (or more) variables (for example, a correlation, regression slope, or odds ratio). Assume that there are two variables of interest, "x" and "y". If not done already, put the two variables into a data frame (here called "mydata"),mydata <- cbind.data.frame(x, y, stringsAsFactors=FALSE)
Then create a function to calculate the statistic of interest on the variables in "mydata". For example, to create a function that calculates the correlation coefficient between the two variables "x" and "y", use
boot.cor <- function(mydata,i){
x <- mydata$x[i]
y <- mydata$y[i]
boot.cor
<- cor(x,y)
}
Here, "i" refers to a vector of indices, which must be included as an argument in the function and employed as shown.
Finally, pass your data frame and function to the boot command,
z <- boot(mydata, boot.cor, R=2000)
See the previous section for a list of commands to pull results from the boot object (here named "z").
Randomization test
A randomization test uses resampling and the computer to generate a null distribution for a test statistic. The test statistic is a measure of association between two variables or difference between groups. Each randomization step involves randomly resampling without replacement the values of one of the two variables in the data. The two variables may be categorical (character or factor), numeric, or one of each.Both variables categorical
R has a built-in randomization procedure for a contingency test of association when both of two variables are categorical (call them A1 and A2). To apply it, execute the usual command for the χ2 contingency test, but set the "simulate.p.value" option to TRUE. The number of replicates in the randomization is set by the option "B" (default is 2000). Each randomization rearranges the values in the contingency table while keeping all the row and column totals fixed to their observed values.chisq.test(A1, A2, simulate.p.value = TRUE, B = 5000)
Data are numeric
If one or both of the variables is numeric, then you will need to create a short loop to carry out the resampling necessary for the randomization test. Choose one of the two variables to resample (call it "x"). It doesn't matter which of the two variables you choose. Keep the other variable (call it "y") unchanged (there is no benefit to resampling both variables).- Resample
"x" without replacement to create a new vector (call it x1).
x1 <- sample(x, replace = FALSE)
- Calculate the test statistic to measure association between y and
the randomized variable x1.
Store
the result in a vector created for this purpose. For example, to calculate the correlation between the two variables,
z <- vector() # initialize z (do only once)
z[1] <- cor(x1, y) # first randomization result - Repeat steps (1) and (2) many times. The result will be a large collection of replicates representing the null distribution of your test statistic.
- Use this null distribution to calculate an approximate P-value for your test. This involves calculating the proportion of values in the null distribution that are as extreme or more extreme than the observed value of the test statistic (calculated from the real data). Whether this calculation must take account of just one or both tails of the null distribution depends on your test statistic.