This page explains how to get started doing repetitive calculations, such as applying a function to many columns or rows of a data frame, or applying the same function to multiple groups.
for
loopsThe for
loop is probably the easiest way to repeat something over and over.
The following simple example repeats the same command 5 times. The element i
is a counter that starts at 1 and increases by 1 each time the commands between the brackets “{ }” are executed.
for(i in 1:5){
print("Back off, man, I'm a scientist")
}
This next example uses i
to index a different element of a vector on each iteration of the loop. The loop below would print the elements of a vector x
, one element on each iteration.
for(i in 1:length(x)){
print(x[i]) # use "print" to print to screen from inside loops
}
Typically, you want to automate a repetitive task and save the results. For example, you might want to generate 100 random samples of size n
= 10 and calculate (and save) the mean each time. The following loop will accomplish this. To save the results in a new vector named myMeans
, create it before starting the loop. Inside the loop save the result from each iteration i
into the i
’th element of the vector.
myMeans <- vector("numeric", length = 100)
for(i in 1:100){
x <- runif(10) # generate 10 random numbers from a uniform distribution
myMeans[i] <- mean(x, na.rm = TRUE)
}
print(myMeans) # see the results!
As a final example, you might have a collection of variables (columns of a data frame mydata
) and want to calculate the sample mean for each variable. The loop below uses i
to index each variable (column) in turn. To save the results in a new vector named result
, create it before starting the loop. Inside the loop save the result from each iteration i
into the i
’th element of the vector.
result <- vector("numeric", length = ncol(mydata)) # initialize vector to store results
for(i in 1:ncol(mydata)){
result[i] <- mean(mydata[ ,i], na.rm = TRUE) # mean of ith variable, and store in result
}
result # see the results!
apply
Use the apply
command to repeat a function on multiple columns of a data frame. Calculations are generally faster than with a for
loop.
MARGIN = 2
in the following example indicates columns. FUN
indicates the function to use on each column. Arguments to FUN
go last (in this example, na.rm = TRUE
is an argument to the mean
function). The output, here stored in result
, is a vector containing the variable means, one for every column in mydata
.
result <- apply(mydata, MARGIN = 2, FUN = mean, na.rm = TRUE)
Home made functions can be used in the same way. The following calculates the standard error of each column (variable) of a data frame mydata
.
se <- function(x){ # x is a dummy variable for the function
s <- sd(x, na.rm = TRUE) # calculate the standard deviation
n <- length(x[!is.na(x)]) # calculate the sample size
se <- s/sqrt(n) # standard error
se # what the function will return
}
result <- apply(mydata, 2, FUN = se)
apply
The apply
command is also used to repeat a function on multiple rows of a data frame.
The command is the similar to that used above on columns, except that a MARGIN = 1 is used to indicate rows. The output, here stored in result
, is a vector containing the means, one for each of the rows in mydata
.
result <- apply(mydata, MARGIN = 1, FUN = mean, na.rm = TRUE)
tapply
Use tapply
to analyze a vector (variable) separately by groups. For example, to calculate the median of x
separately for each group identified by the variable A
,
result <- tapply(mydata$x, INDEX = mydata$A, FUN = median, na.rm=TRUE)
Function options go last: in this example, na.rm = TRUE
is an option of the function median
.
aggregate
Use aggregate
instead of tapply
to analyze multiple variables at once by group. The method lets you identify groups using more than one categorical variable, if needed (e.g., month and year).
For example, to calculate the median of the 2nd through 5th variables in mydata
, separately for each group identified by the two categorical variables A
and B
, use
result <- aggregate(mydata[ ,2:5], by = list(mydata$A, mydata$B),
FUN = median, na.rm = TRUE)
The group variable or variables must be enclosed in list()
, even if you are using only one grouping variable.
© 2009-2024 Dolph Schluter