This page introduces the basics of working with data sets often with multiple variables of different types.
Data frames are the most convenient data objects in R.
Tibbles are a type of data frame used by
tidyverse
packages and can be converted to data frames as
follows.
# convert tibble type to data frame
mydata <- as.data.frame(mydata)
# do the reverse
mydata <- as_tibble(mydata)
You will also run into matrices, arrays, and lists, which are described briefly at the end. A section is added on manipulating image data.
You won’t need anything but base R to read and write data to files. The following packages are for more specialized uses, shown below. You might need to install them first.
library(readr) # includes tidyverse read_csv() command to read .csv files
library(readxl) # read data from excel files
library(reshape2) # reshape data from wide to long format
library(googlesheets4) # read data from online google spreadsheet
library(data.table) # for really fast read and write
library(vroom) # for really fast read
library(imager) # for reading, manipulating, writing images
Enter your data using a spreadsheet program and save it in text format to a comma-delimited file (e.g., “myfile.csv”). A text file is never obsolete but data saved in a proprietary format may not be readable 30 years from now.
Read your data into R using the read.csv()
command (more
details are below).
mydata <- read.csv("myfile.csv")
size.mm
or size_mm
).
.csv
file.
A “long” layout is recommended, instead of a “wide” layout, when using linear models to analyze data. Use different columns for variables and different rows for sample units.
Here’s an example of a “wide” layout with data on the running speed of individual marked lizards in separate years. The same response variable, running speed, is incuded in two columns, one for each year. (Data are from Huey, R. B. and A. E. Dunham. 1987. Evolution 42: 1116-1120.)
lizard1 <- read.csv("https://www.zoology.ubc.ca/~schluter/R/csv/LizardSpeed.csv")
head(lizard1)
## lizardID speed1984 speed1985
## 1 1 1.43 1.37
## 2 2 1.56 1.30
## 3 3 1.64 1.36
## 4 4 2.13 1.54
## 5 5 1.96 1.82
## 6 6 1.89 1.79
The wide layout can be converted to a long layout using the
melt()
command in the reshape2
package.
library(reshape2) # load the package
lizard2 <- melt(lizard1, id.vars = c("lizardID"), variable.name = "Year",
value.name = "Speed", factorsAsStrings = FALSE)
head(lizard2)
## lizardID Year Speed
## 1 1 speed1984 1.43
## 2 2 speed1984 1.56
## 3 3 speed1984 1.64
## 4 4 speed1984 2.13
## 5 5 speed1984 1.96
## 6 6 speed1984 1.89
The following command in base R reads a data file named “myfile.csv”
into a data frame mydata
.
mydata <- read.csv("myfile.csv")
mydata <- read.csv(file.choose()) # navigate to file location
R treats “word” and ” word” (with a leading space) differently, which is not usually desired. To remove leading and trailing spaces,
# remove leading and trailing spaces, treat blank cells as missing too
mydata <- read.csv("myfile.csv", strip.white = TRUE,
na.strings = c("NA",""))
Alternatively, you can use the read_csv()
command from
the readr
package, which automatically takes care of
leading and trailing spaces and knows what to do with blank cells.
library(readr)
mydata <- read_csv("myfile.csv")
If you have allowed anyone with the link to read a Google sheet of
your own making, then you can read it into R. To demonstrate, I have
created a small Google sheet to test. You’ll need
googlesheets4
package.
library(googlesheets4)
# Suspend authorization
gs4_deauth()
# My test sheet
testSheet <- gs4_get("https://docs.google.com/spreadsheets/d/1yRSa4WMnVUc8Q46Td__gOcFar1aigdfxId_YaZZ27yA/edit?usp=sharing")
# Read the data
mydata <- as.data.frame(read_sheet(testSheet, sheet = 1))
Yes, this is possible. You’ll need the readxl
package.
library(readxl)
# Read the data
mydata <- as.data.frame(read_excel("myExcelFile.xlsx", sheet = 1))
To write the data frame mydata
to a comma-delimited text
file,
write.csv(mydata, file = "myfile.csv")
For big data sets that seem to take forever to read and write, try
these commands from the data.table
package.
Use fwrite()
to write to a .csv
file really
quickly.
library(data.table)
fwrite(snps, file = "mydata.csv", sep = ",", col.names = TRUE, row.names = TRUE)
Use fread()
to read a big .csv
file. If
your file has row names, these will be placed into the first variable
column and must be restored with additional commands.
mydata <- data.frame(fread(file = "myfile.csv"))
# If you want to recover row names in first column:
rownames(mydata) <- snps[,1]
mydata <- mydata[,-1]
Another method to read really large .csv
files is to use
vroom()
.
library(vroom)
mydata <- vroom("myfile.csv")
read.csv()
will automatically
classify your variables into the following types unless you specify
otherwise using additional arguments.
To check on how R has classified all your variables, enter
str(mydata) # base R
glimpse(mydata) # dplyr command
To check on R’s classification of just one variable, x,
class(mydata$x) # integer, character, factor, numeric, etc
is.character(mydata$x) # result: TRUE or FALSE
is.integer(mydata$x) # result: TRUE or FALSE
is.factor(mydata$x) # result: TRUE or FALSE
Include the read.csv()
argument
stringsAsFactors = TRUE
if you want character data columns
turned into factor variables instead of character text.
A factor is a categorical variable whose categories represent levels. These levels have names, but they additionally have a numeric interpretation. If a variable A has 3 categories “a”, “b”, and “c”, R will order the levels alphabetically, by default, and give them the corresponding numerical interpretations 1, 2, and 3. This will determine the order that the categories appear in graphs and tables.
You can always change the order of the levels for a factor variable. For example, if you want “c” to be first (e.g., because it refers to the control group), set the order as follows:
A <- factor(A, levels = c("c", "a", "b"))
You can convert variables to a different type.
mydata$A <- as.factor(mydata$A) # character to factor
mydata$A <- as.character(mydata$A) # factor to character
mydata$A <- as.numeric(as.character(mydata$A)) # factor to numeric
Always check the results of a conversion to make sure R did what you wanted. Check especially how missing values were converted.
The following commands are useful for viewing aspects of a data frame.
mydata # print whole data frame
print(mydata, n=5) # print the first 5 rows
head(mydata) # print the first few rows
tail(mydata) # print the last few rows
names(mydata) # see the variable names
rownames(mydata) # view row names (numbers, if you haven't assigned names)
These functions are applied to the whole data frame.
str(mydata) # summary of variables in frame
is.data.frame(mydata) # TRUE or FALSE
ncol(mydata) # number of columns in data
nrow(mydata) # number of rows
names(mydata) # variable names
names(mydata)[1] <- c("quad") # rename 1st variable to "quad"
rownames(mydata) # row names
The columns of the data frame are vectors representing variables. They can be accessed several ways.
mydata$site # the variable named "site" in mydata
mydata[ , 2] # the second variable (column) of the data frame
mydata[5, 2] # the 5th element (row) of the second variable
select(mydata, site) # same as mydata$site but using dplyr package
For example, here is how to log transform a variable named
size.mm
and save the result as a new variable named
logsize
in the data frame mydata
(log
yields the natural log, whereas the function
log10
yields log base 10.)
mydata$logsize <- log(mydata$size.mm) # natural log
mydata <- mutate(mydata, logsize = log(size.mm)) # equivalent command in dplyr package
For example, to delete the variable site
from
mydata
, use
mydata$site <- NULL # NULL must be upper case
mydata <- select(mydata, -site) # dplyr method
There are several ways.
One is to use indicators inside square brackets using the following
format: mydata[rows, columns]
.
newdata <- mydata[ , c(2,3)] # all rows, columns 2 and 3 only;
newdata <- mydata[ , -1] # all rows, leave out first column
newdata <- mydata[1:3, 1:2] # first three rows, first two columns
Logical statements and variable names within the square brackets also work.
The following commands extract three variables of females whose size is less than 25 mm. Note the double “==” sign to represent “equals” in the logical statement.
newdata <- mydata[mydata$sex == "f" & mydata$size.mm < 25,
c("site","id","weight")]
newdata <- subset(mydata, sex == "f" & size.mm < 25,
select = c(site,id,weight))
You can also use dplyr
’s filter
and
select
commands. Use select
to extract
variables (columns), and use filter
to select rows, as in
the following examples.
# extract rows
temp <- filter(mydata, sex == "f")
# extract columns
newdata <- select(temp, site, id, weight)
To re-order the rows of a data frame mydata
to
correspond to the sorted order of one of its variables, say
x
, use
mydata.x <- mydata[order(mydata$x), ] # base R
mydata.x <- arrange(mydata, x) # dplyr method
Measurements stored in two data frames might relate to one another. For example, one data frame might contain measurements of individuals of a bird species (e.g., weight, age, sex) caught at multiple sites. A second data frame might contain physical measurements made at those sites (e.g., elevation, rainfall). If the site names in both data frames correspond, then it is possible to bring one or all the variables from the second data frame to the first.
For example, to bring the site variable “elevation” from the
sites
data frame to the birds
data frame,
birds$elevation <- sites$elevation[match(birds$siteno, sites$siteno)]
To bring all the variables from the sites data set to the bird data
set, corresponding to the same sites in both data frames, use the
dplyr
command
birds2 <- left_join(birds, sites, by="siteno")
Always check the results to make sure R did what you wanted.
Some functions will give a matrix as output, which is not as convenient as a data frame. For example, all columns of a matrix must be of the same data type. Briefly, here’s how to manipulate matrices and convert them to data frames.
mydata <- as.data.frame(xmat)
dim(xmat) # dimensions (rows & columns) of a matrix
ncol(xmat) # number of columns in matrix
nrow(xmat) # number of rows
t(xmat) # transpose a matrix
Use integers in square brackets to access subsets of a matrix. Within square brackets, integers before the comma refer to rows, whereas integers after the comma indicate columns: [rows, columns].
xmat[2,3] # value in the 2nd row, 3rd column of matrix
xmat[, 2] # 2nd column of matrix (result is a vector)
xmat[2, ] # 2nd row of matrix (result is a vector)
xmat[ ,c(2,3)] # matrix subset containing columns 2 and 3 only
xmat[-1, ] # matrix subset leaving out first row
xmat[1:3,1:2] # submatrix containing first 3 rows and first 2 columns only
Use matrix
to reshape a vector into a matrix. For
example, if
x <- c(1,2,3,4,5,6)
xmat <- matrix(x, nrow = 2)
yields the matrix
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
and
xmat <- matrix(x, nrow = 2, byrow = TRUE)
yields the matrix
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
Use cbind
to bind vectors in columns of equal length,
and use rbind
to bind them by rows instead. For
example,
x <- c(1, 2, 3)
y <- c(4, 5, 6)
xmat <- cbind(x, y)
yields the matrix
x y
[1,] 1 4
[2,] 2 5
[3,] 3 6
Some R functions will output results in a list. A list is a collection of R objects bundled together in a single object. The component objects can be anything at all: vectors, matrices, data frames, and even other lists. The different objects needn’t have the same length or number of rows and columns.
Use the list
command to create a list of multiple
objects. For example, here two vectors are bundled into a list
x <- c(1,2,3,4,5,6,7)
y <- c("a","b","c","d","e")
mylist <- list(x,y) # simple version
mylist <- list(name1 = x, name2 = y) # give a name to each list object
Entering mylist
in the R command window shows the
contents of the list, which is
[[1]]
[1] 1 2 3 4 5 6 7
[[2]]
[1] "a" "b" "c" "d" "e"
if the components were left unnamed, or
$name1
[1] 1 2 3 4 5 6 7
$name2
[1] "a" "b" "c" "d" "e"
if you named the list components.
Use the “$” symbol to name a new object in the list.
z <- c("A","C","G","T")
mylist$name3 <- z
Use the “$” to grab a named object in a list. Or, use an integer between double square brackets,
mylist$name2 # the 2nd list object
mylist[[2]] # the 2nd list component, here a vector
mylist[[1]][4] # the 4th element of the 1st list component, here "4"
names(mylist) # NULL if components are unnamed
unlist(mylist) # collapse list to a single vector
This is advised only if all list objects are vectors of equal length.
# construct a list
x <- c(1,2,3,4,5,6,7)
y <- c("a","b","c","d","e","f","g")
mylist <- list(x = x, y = y)
# convert it to a data frame
mydata <- do.call("cbind.data.frame", mylist)
Images are stored as arrays of numbers in R. A matrix is a type of
array, but arrays can have more than two dimensions. The
imager
package is useful for reading, manipulating, and
writing images.
library(imager)
img <- load.image("myimage.jpg")
I’ll use the following example image to demonstrate the commands,
stored on the web. The imported image is stored as a cimg
object, which is a type of array.
pip <- load.image("https://www.zoology.ubc.ca/~schluter/R/images/pip.jpg")
The dimensions of the image array can be checked with the
dim()
command. The first two dimensions of the image are
the width (1536 pixels) and height (2040 pixels) of the 2D image. The
third dimension is the depth of the image, which is 1 for a 2D image (a
cimg
object can also store 3D images). The 4th dimension of
the array indicates the color channels (3 for RGB color).
dim(pip)
## [1] 1536 2040 1 3
For a 2D image, such as our example, each color channel is a matrix of 1536 x 2040 numbers. Each element of the matrix corresponds to a pixel and contains a number representing the intensity of the color. The matrix is curently too big to display in this document but you can see the contents of the 5 x 5 pixels in the top left corner of the matrix of the “G” channel as follows.
pip[1:5, 1:5, 1, 2]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.2705882 0.2784314 0.2862745 0.2901961 0.3058824
## [2,] 0.2823529 0.2941176 0.2980392 0.3019608 0.3176471
## [3,] 0.2941176 0.3019608 0.3058824 0.3137255 0.3254902
## [4,] 0.2901961 0.2941176 0.3019608 0.3098039 0.3254902
## [5,] 0.2745098 0.2784314 0.2862745 0.3019608 0.3254902
Each element of the matrix in a given color channel ranges from 0 (no color) to 1 (full color).
The plot()
command will display the image.
plot(pip, axes = FALSE)
Cropping an image involves removing pixels from the edges. The
crop.borders()
accomplished this.
For example, to crop the example image so that it is square rather than rectangular, I need to remove 2040 - 1536 = 504 pixels from the height, or 504/2 = 252 pixels from the top and bottom.
pip_square <- crop.borders(pip, nx = 0, ny = 252)
dim(pip_square)
## [1] 1536 1536 1 3
plot(pip_square, axes = FALSE)
Resizing an image involves changing the number of pixels in the image. Here I downsample the cropped example image to 100 x 100 pixels.
pip_100 <- resize(pip_square, size_x = 100, size_y = 100)
dim(pip_100)
## [1] 100 100 1 3
plot(pip_100, axes = FALSE)
The grayscale()
command converts an RGB image to black
and white. The resulting image now has only one channel indicating
grayscale, with 0 indicating black, 1 indicating white, and numbers in
between indicating shades of gray.
pip_bw <- grayscale(pip_100)
dim(pip_bw)
## [1] 100 100 1 1
plot(pip_bw, axes = FALSE)
For some applications it is necessary to transpose the image. To accomplish this, we transpose the matrix of each color channel. In the case of a grayscale image the only color channel (4th dimension of the array) is 1.
pip_transposed <- pip_bw # initialize
pip_transposed[,,1,1] <- t(pip_transposed[,,1,1])
plot(pip_transposed, axes = FALSE)
The save.image()
command saves the image to a file. The
file type is determined by the file extension.
save.image(pip_bw, "pip_bw.jpg")
© 2009-2024 Dolph Schluter