Manage data

This page introduces the basics of working with full data sets, which tend to have multiple variables often of several types. It also explains general methods to apply functions to multiple variables and groups at once.

Read data from text file

Work with a spreadsheet program

It is recommended that you store your data in an ordinary text file created with the help of a spreadsheet program, such as Excel or Calc. A text file will always be readable, even once Microsoft goes under and the Excel format becomes obsolete.

Data layout

Here is a recommended layout for your data in the spreadsheet. (A workaround is always possible in R if the data have an alternative layout, but it can take time.)

Keep data that you want analyzed together in a single spreadsheet (worksheet).
Use a "long" layout for your data. Put variables in separate columns, with variable names (date, size, sex, etc) in the top row.
Use brief, informative variable names in plain text. Keep more detailed explanations of variables in a separate text file, not in the data spreadsheet. If a variable name has more than one word, join the words with a dot (e.g., “size.mm”) to eliminate spaces. Do not include commas or non-ascii characters in variable names.
Each row should contain the measurements of only a single sample unit (subject, plot, aquarium, etc). Add rows, not columns, as you add data on more units.
Put repeat measurements taken on the same sampling unit (e.g., measurements taken on multiple dates) in separate rows rather than separate columns. Include a new variable to indicate date or measurement number (first, second, etc).
Use the same symbol consistently for missing values and you will be able to alert R when you input the data. By default, R will interpret NA in a column of data as a missing value. (Consider using a different abbreviation to indicate North America or North Africa!) R will also treat a blank entry as missing for most data types except characters and factors -- which will treat a blank entry as a character string with zero length -- not the same as missing.
Avoid putting non-numeric characters in columns of numeric data, because otherwise R will assume that the entire column is non-numeric. For example, avoid using "12.67?" to indicate a number you are not sure about. Instead, enter 12.67 in the column as measured, but also add a new variable to the data set called “is.ok”, using a 0 (for example) to indicate a dubious measurement and a 1 otherwise.
Avoid using the standard date formats of your spreadsheet program, which can yield ambiguous numbers. Instead, use the international format (YYYY-MM-DD) or use separate columns for year, month and day.
Keep commas out of your data set entirely, because they will be needed as column delimiters in your .csv file.
R is case-sensitive: "Hi" and "hi" are distinct entries. If this is not desired, transform the variable later to a uniform case using the casefold command.

Read data into R

To read data into R, save your worksheet as a comma-delimited text file (.csv file). Then read the file using the read.csv command. The command without any of the options is

mydata <-read.csv(file.choose())

Navigate to your file in the resulting pop-up window.

Alternatively, specify the full directory and file name,

mydata <- read.csv("c:\\directoryname\\filename.csv") # PC

mydata <- read.csv("/directoryname/filename.csv) # Mac

The read.csv command has lots of options, but perhaps the most useful are the following.

mydata <-read.csv(file.choose(),

stringsAsFactors = FALSE,

strip.white = TRUE,

na.strings = c("NA","") )

The "stringsAsFactors = FALSE" tells R to keep character variables as they are rather than convert character data to factors (see below). Factors are a little finicky compared with character data. Character variables can later be converted to factors as the need arises.
The "strip.white = TRUE" option makes sure that blank spaces at the start and end of character data are removed when the data file is read. Often blanks are introduced accidentally during data entry. R treats "word" and " word" differently, which is not usually desired.
Finally, " na.strings = c("NA","") " tells R that in addition to the usual NA, empty strings in columns of character data are also to be treated as missing. By default, R treats a blank cell in a column of character data as a character string of zero length rather than as missing.

R automatically calls variable types

As it reads your data, R will classify your variables into types.

Columns with only numbers are made into numeric or integer variables.
Columns with non-numeric characters are made into factors unless you specify that they should remain characters using the "stringsAsFactors = FALSE" option in the read command.

To explain, a factor is a categorical variable whose unique values represent levels. These levels are named, like characters, but the levels additionally have a numerical interpretation. You will eventually need to convert character variables to factors if you plan to fit linear models, but there's little rush.

To check on how R has classified all your variables, enter

str(mydata) # "str" stands for structure

To check on R's classification of just one variable, x,

class(mydata$x) # integer, character, factor, numeric, etc

is.factor(mydata$x) # result: TRUE or FALSE

is.character(mydata$x) # result: TRUE or FALSE

is.integer(mydata$x) # result: TRUE or FALSE

Convert between variable types

You can always convert variables between types. The following should work well:

mydata$x <- as.factor(mydata$x) # character to factor

mydata$x <- as.character(mydata$x) # factor to character

Warning: Converting factors to numeric or integer data can lead to unwanted outcomes. For example, the value "-1.32" is a perfectly good factor level, but converting such a factor level to a numeric value using "as.numeric" would not yield the number -1.32. Instead, the result would be a positive integer. However, converting the character "-1.32" using the command "as.numeric" would yield the number -1.32. (This is one example of how character data are easier to work with than factors.)

Always check the results of a conversion to make sure R did what you wanted.

Write/save a data frame to a text file

To write a data frame to a comma delimited text file, use the following commands. Include the option "row.names = FALSE" if you don't want the row indicators of the data frame included in the first column of the csv file.

write.csv(mydata, file="c:\\directoryname\\filename.csv") # PC

write.csv(mydata, file="/directoryname/filename.csv") # Mac

The following modification seems to work better in Windows than on a Mac if a version of the text file does not already exist.

write.csv(mydata, file=file.choose())

Manage data frames

View and edit the data frame

The following commands are useful for viewing aspects of a data frame.

head(mydata) # prints the first few rows

names(mydata) # see just the variable names

str(mydata) # check the variable types

rownames(mydata) # view row names (gives row numbers if
# you haven't assigned rows names)

To view (and make edits to) the entire data frame, enter

fix(mydata)

This will open up a window in which you can scroll through the data. You will be able to make changes to individual cells or change variable names. You can’t change data types in this window.

Warning: Any changes made to the data frame in this way go unrecorded. It is far better to make edits using the command window so that you keep a record of them in your script file. For example, rather than change the value of the 3rd row, second column to "5" using "fix", enter the following in the commnd window:

mydata[3,2] <- c(5)

Transform a variable and save the result in the data frame

For example, to log-transform the variable "size.mm" and store it as a new variable, "logsize", in the same data frame,

mydata$logsize <- log(mydata$size.mm) # natural log

Delete a variable from a data frame

To delete the variable "x" from the data frame

mydata$x <- NULL # NULL must be in upper case letters

Relate variables in two data frames

Often, measurements stored in two data frames relate to one another. One data frame might contain measurements of all captured individuals of a bird species (e.g., weight, age, sex), including the study site in which the individual was captured. A second data frame might contain physical measurements made on those study sites (e.g., elevation, rainfall). If the site names or numbers in both data frames correspond, then it is possible to bring measurements from one frame to the other using the "match" command.

For example, to bring the site variable "elevation" from the "sites" data frame to the "birds" data frame

birds$elevation <-

sites$elevation[match(birds$siteno, sites$siteno)]

Always check the results to make sure R did what you wanted.

Attach a data frame

It can be cumbersome to type the name of the data frame over and over to carry out operations and function on the variables within, such as

plot(mydata$length.mm, mydata$mass.g)

It would be great if R somehow knew that the variables to plot were in the data frame "mydata". Two functions can help: "attach" and "with".

Attach

Warning: When you attach a data frame, you are only attaching a unchangeable copy. Subsequent changes to variables in the real data frame located in your workspace, and addition of new variables, do not update the attached copy. To update the attached copy after making changes to the data frame you will need to detach and then attach again.

If you are not planning to make changes to the data frame of interest, enter

attach(mydata)

to place it in R's search path. Thereafter, when you type a variable name such as "length.mm" or "mass.g", the attached data frame "mydata" is the second place R will look for it. (R will first look in the workspace, the usual place where objects are located, so any object there with the same name as your variable will have precedence.) So you can simply type, for example,

plot(length.mm, mass.g)

When you are done, enter

detach()

With

Use "with" to force R to look first in your data frame in the workspace for any variables,

with(mydata, plot(length.mm, mass.g))

Sort and order rows of a data frame

It is often convenient to re-order the rows of a data frame "mydata" to correspond to the sorted order of one of its variables, for example the variable "x". To do this use the "order" function as follows,

mydata.x <- mydata[order(mydata$x), ]

Apply the same function to several variables at once

R is outstanding at carrying out calculations on many variables at once.

Apply an operation to all rows or to all columns

The apply command will carry out an operation on all rows or columns of a data frame or matrix.

For example, to obtain the row sums of a data frame, summing across variables,

apply(mydata, 1, sum, na.rm=TRUE) # the "1" refers to rows

Obtaining the column sums instead is similar

apply(mydata, 2, sum, na.rm=TRUE) # the "2" refers to columns

Other vector functions will also work in place of sum.

Apply a function to several variables in a data frame

The sapply and lapply commands allow you to apply any function to many variables in a data frame at once. They differ in the format of their output.

For example, here's how to calculate the mean of the variables in columns 2 through 5 in a data frame named "mydata". A key feature of this first example is that the function of interest (the mean) yields just a single number for each variable.

myresult <- sapply(mydata[,2:5], FUN = mean)

mylist <- lapply(mydata[,2:5], FUN = mean)

The only difference between the two commands used above is that sapply returns a vector whereas lapply returns a list.

To pass options to the function, include them after FUN. For example, to tell the function "mean" to exclude missing values,

mylist <- lapply(mydata[,2:5], FUN = mean, na.rm = TRUE)

You can, in the same way, use functions whose output is a vector of numbers for each variable (rather than a single number, like the mean). An example is the function "summary", which calculates several descriptive statistics on a given variable.

myresult <- sapply(mydata[,2:5], FUN=summary) # returns a matrix

mylist <- lapply(mydata[,2:5], FUN=summary) # returns a list

It is usually convenient to convert a matrix to a data frame if you want to save it or carry out further analyses on it.

Use a loop

Alternatively, you can analyze variables one at a time using a loop. The example below calculates the standard deviation of each of eight variables in the data frame "mydata". Let's assume that the variables of interest are located in the 2nd, 6th, and 10th through 15th columns of mydata. Here, "i" is the counter.

for(i in c(2,6,10:15)){

print( sd(mydata[,i], na.rm=TRUE) )
}

Apply a function to several groups at once

The following commands can save time when you want to carrty out an analysis by groups. The groups are indicated by a categorical variable (month, continent, etc) that is a factor or a character variable. In the examples below, "x" is a numeric variable and "A" is a categorical variable (a factor or character variable) identifying multiple groups.

Use tapply for one group variable

"tapply" is one of the most useful commands you'll ever need. It allows you to apply a function to a single variable "x" separately for each of two or more groups identified by a categorical variable (call it "A"). For example, to calculate the median of "x" separately for each group identified in the variable "A",

tapply(x, INDEX = A, FUN = median, na.rm=TRUE)

In this example, "na.rm" is not a "tapply" option -- rather it is an option for the function "median", telling it to remove missing values before computing. To pass optional arguments to FUN, as in this example, include them immediately after FUN.

Use aggregate for more than one group variable

You'll want to use "aggregate" instead of "tapply" if your groups are identified by more than one categorical variable (e.g., MONTH and YEAR). The working of the command is slightly different from that of "tapply". For example, to calculate the median of "x" separately for each group identified by the two categorical variables "A" and "B",

aggregate(x, by = list(A,B), FUN = median, na.rm = TRUE)

Use aggregate to apply function to several variables

"aggregate" also allows you to apply a function to multiple variables separately for each of two or more groups. For example, to calculate the median for the variables in the 2nd through 5th column of the data frame "mydata", separately for each group identified in the variable "A",

aggregate(mydata[,2:5], by = list(A), FUN = median, na.rm=TRUE)

Use a loop

Another workable strategy uses a loop to apply a function to multiple groups. The following example uses a loop to carry out a regression of y on x separately for each group identified in the variable "A". x and y are both numeric variables,

Manage data

Read data from text file

Work with a spreadsheet program

Data layout

Read data into R

mydata <-read.csv(file.choose())

mydata <- read.csv("c:\\directoryname\\filename.csv") # PC

mydata <- read.csv("/directoryname/filename.csv) # Mac

mydata <-read.csv(file.choose(),

stringsAsFactors = FALSE,

strip.white = TRUE,

na.strings = c("NA","") )

R automatically calls variable types

str(mydata) # "str" stands for structure

class(mydata$x) # integer, character, factor, numeric, etc

is.factor(mydata$x) # result: TRUE or FALSE

is.character(mydata$x) # result: TRUE or FALSE

is.integer(mydata$x) # result: TRUE or FALSE

Convert between variable types

mydata$x <- as.factor(mydata$x) # character to factor

mydata$x <- as.character(mydata$x) # factor to character

Write/save a data frame to a text file

write.csv(mydata, file="c:\\directoryname\\filename.csv") # PC

write.csv(mydata, file="/directoryname/filename.csv") # Mac

write.csv(mydata, file=file.choose())

Manage data frames

View and edit the data frame

head(mydata) # prints the first few rows

names(mydata) # see just the variable names

str(mydata) # check the variable types

rownames(mydata) # view row names (gives row numbers if # you haven't assigned rows names)

fix(mydata)

mydata[3,2] <- c(5)

Transform a variable and save the result in the data frame

mydata$logsize <- log(mydata$size.mm) # natural log

Delete a variable from a data frame

mydata$x <- NULL # NULL must be in upper case letters

Relate variables in two data frames

birds$elevation <-

sites$elevation[match(birds$siteno, sites$siteno)]

Attach a data frame

plot(mydata$length.mm, mydata$mass.g)

Attach

attach(mydata)

plot(length.mm, mass.g)

detach()

With

with(mydata, plot(length.mm, mass.g))

Sort and order rows of a data frame

mydata.x <- mydata[order(mydata$x), ]

Apply the same function to several variables at once

Apply an operation to all rows or to all columns

apply(mydata, 1, sum, na.rm=TRUE) # the "1" refers to rows

apply(mydata, 2, sum, na.rm=TRUE) # the "2" refers to columns

Apply a function to several variables in a data frame

myresult <- sapply(mydata[,2:5], FUN = mean)

mylist <- lapply(mydata[,2:5], FUN = mean)

mylist <- lapply(mydata[,2:5], FUN = mean, na.rm = TRUE)

myresult <- sapply(mydata[,2:5], FUN=summary) # returns a matrix

mylist <- lapply(mydata[,2:5], FUN=summary) # returns a list

Use a loop

for(i in c(2,6,10:15)){

print( sd(mydata[,i], na.rm=TRUE) ) }

Apply a function to several groups at once

Use tapply for one group variable

tapply(x, INDEX = A, FUN = median, na.rm=TRUE)

Use aggregate for more than one group variable

aggregate(x, by = list(A,B), FUN = median, na.rm = TRUE)

Use aggregate to apply function to several variables

aggregate(mydata[,2:5], by = list(A), FUN = median, na.rm=TRUE)

Use a loop

Agroups <- unique(na.omit(A)) # Unique group names of A

for(i in Agroups){

xi <- x[A == i] # select the x data for group i

yi <- y[A == i] # select the y data for group i

z <- lm( yi ~ xi ) # the regression analysis for group i

print(summary(z)) }

rownames(mydata) # view row names (gives row numbers if
# you haven't assigned rows names)

print( sd(mydata[,i], na.rm=TRUE) )
}

print(summary(z))
}