Manage data
This page introduces the basics of working with full data sets, which tend to have multiple variables often of several types. It also explains general methods to apply functions to multiple variables and groups at once.Read data from text file
Work with a spreadsheet program
It is recommended that you store your data in an ordinary text file created with the help of a spreadsheet program, such as Excel or Calc. A text file will always be readable, even once Microsoft goes under and the Excel format becomes obsolete.Data layout
Here is a recommended layout for your data in the spreadsheet. (A workaround is always possible in R if the data have an alternative layout, but it can take time.)- Keep data that you want analyzed together in a single spreadsheet (worksheet).
- Use a "long" layout for your data. Put variables in separate columns, with variable names (date, size, sex, etc) in the top row.
- Use brief, informative variable names in plain text. Keep more detailed explanations of variables in a separate text file, not in the data spreadsheet. If a variable name has more than one word, join the words with a dot (e.g., “size.mm”) to eliminate spaces. Do not include commas or non-ascii characters in variable names.
- Each row should contain the measurements of only a single sample unit (subject, plot, aquarium, etc). Add rows, not columns, as you add data on more units.
- Put repeat measurements taken on the same sampling unit (e.g., measurements taken on multiple dates) in separate rows rather than separate columns. Include a new variable to indicate date or measurement number (first, second, etc).
- Use the same symbol consistently for missing values and you will be able to alert R when you input the data. By default, R will interpret NA in a column of data as a missing value. (Consider using a different abbreviation to indicate North America or North Africa!) R will also treat a blank entry as missing for most data types except characters and factors -- which will treat a blank entry as a character string with zero length -- not the same as missing.
- Avoid putting non-numeric characters in columns of numeric data, because otherwise R will assume that the entire column is non-numeric. For example, avoid using "12.67?" to indicate a number you are not sure about. Instead, enter 12.67 in the column as measured, but also add a new variable to the data set called “is.ok”, using a 0 (for example) to indicate a dubious measurement and a 1 otherwise.
- Avoid using the standard date formats of your spreadsheet program, which can yield ambiguous numbers. Instead, use the international format (YYYY-MM-DD) or use separate columns for year, month and day.
- Keep commas out of your data set entirely, because they will be needed as column delimiters in your .csv file.
- R is case-sensitive: "Hi" and "hi" are distinct entries. If this is not desired, transform the variable later to a uniform case using the casefold command.
Read data into R
To read data into R, save your worksheet as a comma-delimited text file (.csv file). Then read the file using the read.csv command. The command without any of the options ismydata <-read.csv(file.choose())
Navigate to your file in the resulting pop-up window.Alternatively, specify the full directory and file name,
mydata <- read.csv("c:\\directoryname\\filename.csv") # PC
mydata <- read.csv("/directoryname/filename.csv) # Mac
The read.csv command has lots of options, but perhaps the most useful are the following.
mydata <-read.csv(file.choose(),
stringsAsFactors = FALSE,
strip.white = TRUE,
na.strings = c("NA","") )
The "stringsAsFactors = FALSE" tells R to keep character variables as they are rather than convert character data to factors (see below). Factors are a little finicky compared with character data. Character variables can later be converted to factors as the need arises.The "strip.white = TRUE" option makes sure that blank spaces at the start and end of character data are removed when the data file is read. Often blanks are introduced accidentally during data entry. R treats "word" and " word" differently, which is not usually desired.
Finally, " na.strings = c("NA","") " tells R that in addition to the usual NA, empty strings in columns of character data are also to be treated as missing. By default, R treats a blank cell in a column of character data as a character string of zero length rather than as missing.
R automatically calls variable types
As it reads your data, R will classify your variables into types.- Columns with only numbers are made into numeric or integer variables.
- Columns with non-numeric characters are made into factors unless you specify that they should remain characters using the "stringsAsFactors = FALSE" option in the read command.
To check on how R has classified all your variables, enter
str(mydata) # "str" stands for structure
To check on R's classification of just one variable, x,
class(mydata$x) # integer, character, factor, numeric, etc
is.factor(mydata$x) # result: TRUE or FALSE
is.character(mydata$x) # result: TRUE or FALSE
is.integer(mydata$x) # result: TRUE or FALSE
Convert between variable types
You can always convert variables between types. The following should work well: mydata$x
<- as.factor(mydata$x) # character to
factor
mydata$x <- as.character(mydata$x) # factor to character
Warning: Converting factors to numeric or integer data can lead to unwanted outcomes. For example, the value "-1.32" is a perfectly good factor level, but converting such a factor level to a numeric value using "as.numeric" would not yield the number -1.32. Instead, the result would be a positive integer. However, converting the character "-1.32" using the command "as.numeric" would yield the number -1.32. (This is one example of how character data are easier to work with than factors.)
Always check the results of a conversion to make sure R did what you wanted.
Write/save a data frame to a text file
To write a data frame to a comma delimited text file, use the following commands. Include the option "row.names = FALSE" if you don't want the row indicators of the data frame included in the first column of the csv file.write.csv(mydata, file="c:\\directoryname\\filename.csv") # PC
write.csv(mydata, file="/directoryname/filename.csv") # Mac
The following modification seems to work better in Windows than on a Mac if a version of the text file does not already exist.
write.csv(mydata, file=file.choose())
Manage data frames
View and edit the data frame
The following commands are useful for viewing aspects of a data frame.head(mydata) # prints the first few rows
names(mydata) # see just the variable names
str(mydata) # check the variable types
rownames(mydata) # view row names (gives row
numbers if
# you haven't
assigned rows names)
To view (and make edits to) the entire data frame, enter
fix(mydata)
This will open up a window in which you can scroll through the data. You will be able to make changes to individual cells or change variable names. You can’t change data types in this window.
Warning: Any changes made to the data frame in this way go unrecorded. It is far better to make edits using the command window so that you keep a record of them in your script file. For example, rather than change the value of the 3rd row, second column to "5" using "fix", enter the following in the commnd window:
mydata[3,2] <- c(5)
Transform a variable and save the result in the data frame
For example, to log-transform the variable "size.mm" and store it as a new variable, "logsize", in the same data frame,mydata$logsize <- log(mydata$size.mm) # natural log
Delete a variable from a data frame
To delete the variable "x" from the data framemydata$x <- NULL # NULL must be in upper case letters
Relate variables in two data frames
Often, measurements stored in two data frames relate to one another. One data frame might contain measurements of all captured individuals of a bird species (e.g., weight, age, sex), including the study site in which the individual was captured. A second data frame might contain physical measurements made on those study sites (e.g., elevation, rainfall). If the site names or numbers in both data frames correspond, then it is possible to bring measurements from one frame to the other using the "match" command.For example, to bring the site variable "elevation" from the "sites" data frame to the "birds" data frame
birds$elevation <-
sites$elevation[match(birds$siteno, sites$siteno)]
Always check the results to make sure R did what you wanted.
Attach a data frame
It can be cumbersome to type the name of the data frame over and over to carry out operations and function on the variables within, such asplot(mydata$length.mm, mydata$mass.g)
It would be great if R somehow knew that the variables to plot were in the data frame "mydata". Two functions can help: "attach" and "with".
Attach
Warning: When you attach a data frame, you are only attaching a unchangeable copy. Subsequent changes to variables in the real data frame located in your workspace, and addition of new variables, do not update the attached copy. To update the attached copy after making changes to the data frame you will need to detach and then attach again.If you are not planning to make changes to the data frame of interest, enter
attach(mydata)
to place it in R's search path. Thereafter, when you type a variable name such as "length.mm" or "mass.g", the attached data frame "mydata" is the second place R will look for it. (R will first look in the workspace, the usual place where objects are located, so any object there with the same name as your variable will have precedence.) So you can simply type, for example,plot(length.mm, mass.g)
When you are done, enter
detach()
With
Use "with" to force R to look first in your data frame in the workspace for any variables,with(mydata, plot(length.mm, mass.g))
Sort and order rows of a data frame
It is often convenient to re-order the rows of a data frame "mydata" to correspond to the sorted order of one of its variables, for example the variable "x". To do this use the "order" function as follows,mydata.x <- mydata[order(mydata$x), ]
Apply the same function to several variables at once
R is outstanding at carrying out calculations on many variables at once.Apply an operation to all rows or to all columns
The apply command will carry out an operation on all rows or columns of a data frame or matrix.For example, to obtain the row sums of a data frame, summing across variables,
apply(mydata, 1, sum, na.rm=TRUE) # the "1" refers to rows
Obtaining the column sums instead is similar
apply(mydata, 2, sum, na.rm=TRUE) # the "2" refers to columns
Other vector functions will also work in place of sum.
Apply a function to several variables in a data frame
The sapply and lapply commands allow you to apply any function to many variables in a data frame at once. They differ in the format of their output.For example, here's how to calculate the mean of the variables in columns 2 through 5 in a data frame named "mydata". A key feature of this first example is that the function of interest (the mean) yields just a single number for each variable.
myresult <- sapply(mydata[,2:5], FUN = mean)
mylist <- lapply(mydata[,2:5], FUN = mean)
The only difference between the two commands used above is that sapply returns a vector whereas lapply returns a list.
To pass options to the function, include them after FUN. For example, to tell the function "mean" to exclude missing values,
mylist <- lapply(mydata[,2:5], FUN = mean, na.rm = TRUE)
You can, in the same way, use functions whose output is a vector of numbers for each variable (rather than a single number, like the mean). An example is the function "summary", which calculates several descriptive statistics on a given variable.
myresult <- sapply(mydata[,2:5], FUN=summary) # returns a matrix
mylist <- lapply(mydata[,2:5], FUN=summary) # returns a list
It is usually convenient to convert a matrix to a data frame if you want to save it or carry out further analyses on it.
Use a loop
Alternatively, you can analyze variables one at a time using a loop. The example below calculates the standard deviation of each of eight variables in the data frame "mydata". Let's assume that the variables of interest are located in the 2nd, 6th, and 10th through 15th columns of mydata. Here, "i" is the counter.for(i in c(2,6,10:15)){
print( sd(mydata[,i], na.rm=TRUE) )
}
Apply a function to several groups at once
The following commands can save time when you want to carrty out an analysis by groups. The groups are indicated by a categorical variable (month, continent, etc) that is a factor or a character variable. In the examples below, "x" is a numeric variable and "A" is a categorical variable (a factor or character variable) identifying multiple groups.Use tapply for one group variable
"tapply" is one of the most useful commands you'll ever need. It allows you to apply a function to a single variable "x" separately for each of two or more groups identified by a categorical variable (call it "A"). For example, to calculate the median of "x" separately for each group identified in the variable "A",tapply(x, INDEX = A, FUN = median, na.rm=TRUE)
In this example, "na.rm" is not a "tapply" option -- rather it is an option for the function "median", telling it to remove missing values before computing. To pass optional arguments to FUN, as in this example, include them immediately after FUN.