Tips and recommendations for making graphs and tables in R.
Base R has a lot of built-in plot functions that are easy to use. For
example, to examine an association between variables x
and
y
you can often just enter
plot(y ~ x)
to get an unfussy plot. But to make the plots beautiful a lot of options need to be added. Most plotting functions will accept the same arguments to control plot elements. For the full list of plotting options in base R, get help as follows.
?par
The tidyverse is a collection of contributed packages for R that
share a common philosophy about data and syntax and work well together.
Two of the most-used packages are ggplot2
for graphics and
dplyr
for data manipulation.
Building a graph using the ggplot()
command involves
adding components or “layers” including data, “aesthetics” that map
variables to visuals, and “geoms” that create different kinds of plots.
For example, a basic scatter plot of yvar
against
xvar
has the following components.
ggplot(data = mydata, mapping = aes(x = xvar, y = yvar)) +
geom_point()
The default ggplot()
style has too much chartjunk for
this writer. Options to alter the theme and other graph components are
shown in the tips for specific graphs below.
Hadley Wickham’s book (Wickam, H. 2016. ggplot2: Elegant graphics
for data analysis. 2nd edition) is the standard
ggplot2
reference, but plenty of introductory resources are
available online (e.g.,
this
one)
library(ggplot2)
library(dplyr)
library(tidyr)
# Single-use packages
library(ggmosaic) # mosaic plot, works with ggplot2
library(vioplot) # violin plot for base R
Examples on this page make use of the following data sets, which you can read using the commands below.
bbird <- read.csv("https://www.zoology.ubc.ca/~schluter/R/csv/BlackbirdTestosterone.csv")
BCfires <- read.csv("https://www.zoology.ubc.ca/~schluter/R/csv/BCfires.csv")
BCfires1 <- subset(BCfires, SIZE_HA >= 1.0)
BCfireFrequency1 <- read.csv("https://www.zoology.ubc.ca/~schluter/R/csv/BCfireFrequency1.csv", row.names=1)
birthDay <- read.csv("https://www.zoology.ubc.ca/~schluter/R/csv/birthDay2016.csv")
mink <- read.csv("https://www.zoology.ubc.ca/~schluter/R/csv/minkBrains.csv")
stickle <- read.csv("https://www.zoology.ubc.ca/~schluter/R/csv/schluterMcphail1992.csv")
titanic <- read.csv("https://www.zoology.ubc.ca/~schluter/R/csv/Titanic.csv")
bbird
has data on immunocompetence of red-winged
blackbirds before and after testosterone implants. From Hasselquist et
al. (1999, Behavioral Ecology and Sociobiology).
BCfires
contains data on all forest fires in the
province of British Columbia, Canada, from 1950 to 2021 (downloaded from
the Canadian National Fire DataBase and cleaned up a little).
BCfires1
is the subset of all fires at least 1 ha in size,
and BCfireFrequency1
is the number of fires greater than 1
ha in size per month and year. The data set is not perfect – small fires
weren’t reported in the early days.
birthDay
contains data on the week day of birth of a
random sample of 350 babies born in the USA in 1999 (example from
Whitlock & Schluter (2020, The Analysis of Biological
Data).
mink
is a data set on brain sizes of wild, farmed, and
feral mink by Pohle et al (2023, Royal Society Open Science). I adjusted
brain size for body size differences (condylobasal length,
cbl
) using common principal components (cpcbp
package).
stickle
contains population means of males and females
from threespine stickeleback populations. The data are from Schluter and
McPhail (1992, American Naturalist).
titanic
contains survival and gender data of passengers
on the disastrouas voyage of the Titanic.
In base R, you can send your plot to a pdf file instead of the screen as follows.
pdf(file = "mygraph.pdf") # opens the pdf device for plotting
plot(...) # Issue your R commands here to generate plot
dev.off() # closes the device when you are done
If you are using ggplot()
, follow your
ggplot()
command with the following line and it will save
it to the named pdf file.
ggsave("mygraph.pdf")
This frequency table counts the number (frequency) of cases in each
category of a categorical variable A
. The optinal
useNA
argument adds the category NA
in the
table if one or more cases is missing.
# base R
table(mydata$A, useNA = "ifany")
# dplyr method
summarize(group_by(mydata, A), Frequency = n())
For example, the following data on gender and survival of people on the Titanic.
head(titanic)
## gender survival
## 1 Male Survived
## 2 Male Survived
## 3 Male Survived
## 4 Male Survived
## 5 Male Survived
## 6 Male Survived
To generate a frequency table of the number of passengers who survived and the number that died, use either of the following two methods
# base R
table(titanic$survival, useNA = "ifany") # base R
##
## Died Survived
## 1438 654
# dplyr method
summarize(group_by(titanic, survival), Frequency = n()) # dplyr
## # A tibble: 2 × 2
## survival Frequency
## <chr> <int>
## 1 Died 1438
## 2 Survived 654
To generate a two-way frequency table for two categorical variables,
A
and B
, use the following command syntax.
table(mydata$A, mydata$B, useNA = "ifany")
For example, the Titanic data set contains survival information for both men and women.
table(titanic$survival, titanic$gender, useNA = "ifany")
##
## Female Male
## Died 109 1329
## Survived 316 338
To add the row and column sums to the table, use
addmargins()
. For example,
mytab <- table(titanic$survival, titanic$gender)
addmargins(mytab, margin = c(1,2), FUN = sum, quiet = TRUE)
##
## Female Male sum
## Died 109 1329 1438
## Survived 316 338 654
## sum 425 1667 2092
dplyr()
can make a contingency table but needs the
spread()
command from the tidyr
package.
For example, using the Titanic data set read above,
mytab <- summarize(group_by(titanic, survival, gender), n = n())
mytab <- spread(mytab, gender, n)
mytab
## # A tibble: 2 × 3
## # Groups: survival [2]
## survival Female Male
## <chr> <int> <int>
## 1 Died 109 1329
## 2 Survived 316 338
Zero table counts are given as NA. There’s an optional fix for that:
mutate(mytab, across(everything(), ~replace_na(., 0)))
The following commands generate a “flat” frequency table for two
categorical variables, A
and B
. In a flat
table, A
and B
are separate columns of a
table, and a third column tallies the frequencies of each combination.
Start with the table()
command but include the names of the
variables as an argument.
mytab <- table(mydata$A, mydata$B, dnn = c("A", "B"))
data.frame(mytab)
For example, using the Titanic data,
mytab <- table(titanic$survival, titanic$gender, dnn = c("survival", "gender"))
data.frame(mytab)
## survival gender Freq
## 1 Died Female 109
## 2 Survived Female 316
## 3 Died Male 1329
## 4 Survived Male 338
The count()
method of dplyr
can make a flat
table but will tally only the combinations of categories that have a
frequency greater than 0. Use the complete()
command from
the tidyr
package to add the 0’s (shown here, but not
actually needed in this example).
mytab <- count(titanic, survival, gender)
complete(mytab, survival, gender, fill = list(n = 0))
## # A tibble: 4 × 3
## survival gender n
## <chr> <chr> <int>
## 1 Died Female 109
## 2 Died Male 1329
## 3 Survived Female 316
## 4 Survived Male 338
R has efficient ways to get means, medians, sums, and other descriptive statistics by column or by row of data frames.
For an example, we’ll use data on the average number of fires (at least 1 ha in size) in BC each month. The mean number of fires in each month is averaged over the years 1950 to 2021.
head(BCfireFrequency1)
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1950 0 0 4 3 36 87 72 34 103 3 0 1
## 1951 0 0 2 84 26 65 115 120 68 0 1 0
## 1952 0 0 4 17 94 16 60 107 69 89 1 0
## 1953 0 0 1 10 59 12 34 53 16 1 0 0
## 1954 0 0 8 16 62 13 30 10 2 1 0 1
## 1955 0 0 0 4 42 59 28 36 29 2 0 6
Use colMeans
to obtain the mean number of fires in each
month averaged over years. To view the resulting means, always arrange
them vertically - data.frame()
or cbind()
will
accomplish this. Here, I’ve also rounded the means to one decimal place
using round()
.
meanNoFires <- colMeans(BCfireFrequency1)
data.frame(meanNoFires = round(meanNoFires, 1))
## meanNoFires
## Jan 0.4
## Feb 0.6
## Mar 7.1
## Apr 49.9
## May 71.8
## Jun 49.4
## Jul 84.0
## Aug 78.9
## Sep 38.8
## Oct 20.5
## Nov 1.5
## Dec 0.3
Use rowSums
to obtain the the number of fires (at least
1 ha in size) in June, July, and August of each year. I use
head()
and tail()
to print the first 6 and the
last 6 years.
nFiresJune_to_Aug <- rowSums(BCfireFrequency1[, c("Jun","Jul","Aug")])
head(data.frame(nFiresJune_to_Aug))
## nFiresJune_to_Aug
## 1950 193
## 1951 300
## 1952 183
## 1953 99
## 1954 53
## 1955 123
tail(data.frame(nFiresJune_to_Aug))
## nFiresJune_to_Aug
## 2016 96
## 2017 249
## 2018 564
## 2019 62
## 2020 70
## 2021 367
colSums()
works the same way. Here is the total number
of files at least 1 ha in size in each month from 1950 to 2021.
nFiresByMonth1950_to_2021 <- colSums(BCfireFrequency1, na.rm = TRUE)
data.frame(nFiresByMonth1950_to_2021)
## nFiresByMonth1950_to_2021
## Jan 29
## Feb 43
## Mar 514
## Apr 3593
## May 5170
## Jun 3555
## Jul 6051
## Aug 5680
## Sep 2794
## Oct 1474
## Nov 106
## Dec 25
More generally, use apply()
with MARGIN = 1
to calculate other statistics by row of a data frame, and with
MARGIN = 2
to get other statistics by column.
For example, to get the median number of fires in BC (at least 1 ha in size) per year from 1950 to 2021 for every month (i.e., columns of the data frame), use the following.
medianNoFires <- apply(BCfireFrequency1, MARGIN = 2, median)
data.frame(medianNoFires)
## medianNoFires
## Jan 0.0
## Feb 0.0
## Mar 5.0
## Apr 44.5
## May 65.0
## Jun 34.0
## Jul 61.5
## Aug 59.5
## Sep 24.0
## Oct 12.0
## Nov 0.0
## Dec 0.0
Several methods calculate descriptive statistics for each level or category of a categorical variable (i.e., “by group”).
Here is how to generate a table of group means for a single numeric
variable y
, where A
is the categorical
variable whose levels are the groups.
Arguments associated with the function used (e.g., na.rm =
TRUE
to the function mean()
), are given right after
FUN
.
# result is a vector
tapply(mydata$y, INDEX = mydata$A, FUN = mean, na.rm = TRUE)
# result is a data frame
aggregate(y ~ A, data = mydata, FUN = mean, na.rm = TRUE)
# dplyr method; result is a data frame
summarize(group_by(mydata, A), ybar = mean(y, na.rm = TRUE))
I illustrate by calculating the total (summed) area burned each year
the forest fires data set, BCfires1
, which includes only
fires at least 1 ha in size.
# base R
totAreaBurned <- tapply(BCfires1$SIZE_HA, INDEX = BCfires1$YEAR, FUN = sum, na.rm = TRUE)
head(data.frame(totAreaBurned))
## totAreaBurned
## 1950 343038.4
## 1951 123688.2
## 1952 61372.3
## 1953 15397.5
## 1954 4126.2
## 1955 22748.4
totAreaBurned <- aggregate(SIZE_HA ~ YEAR, data = BCfires1, FUN = sum, na.rm = TRUE)
head(totAreaBurned)
## YEAR SIZE_HA
## 1 1950 343038.4
## 2 1951 123688.2
## 3 1952 61372.3
## 4 1953 15397.5
## 5 1954 4126.2
## 6 1955 22748.4
# dplyr method
totAreaBurned <- summarize(group_by(BCfires1, YEAR), totAreBurned = sum(SIZE_HA, na.rm = TRUE))
head(totAreaBurned)
## # A tibble: 6 × 2
## YEAR totAreBurned
## <int> <dbl>
## 1 1950 343038.
## 2 1951 123688.
## 3 1952 61372.
## 4 1953 15398.
## 5 1954 4126.
## 6 1955 22748.
With the same data set, I show two different methods to calculate both the median fire size and median fire latitude by year with a single command.
# base R
z <- aggregate(cbind(SIZE_HA, LATITUDE) ~ YEAR,
FUN = median, na.rm = TRUE, data = BCfires1)
head(z)
## YEAR SIZE_HA LATITUDE
## 1 1950 8.0 51.4340
## 2 1951 10.1 50.8470
## 3 1952 10.1 50.9630
## 4 1953 2.8 50.8035
## 5 1954 4.8 50.3760
## 6 1955 4.0 51.9340
# dplyr method
z <- summarize(group_by(BCfires1, YEAR),
medianFireSize = median(SIZE_HA, na.rm = TRUE),
medianLatitude = median(LATITUDE, na.rm = TRUE))
head(z)
## # A tibble: 6 × 3
## YEAR medianFireSize medianLatitude
## <int> <dbl> <dbl>
## 1 1950 8 51.4
## 2 1951 10.1 50.8
## 3 1952 10.1 51.0
## 4 1953 2.8 50.8
## 5 1954 4.8 50.4
## 6 1955 4 51.9
Descriptive statistics can be grouped by more than one attribute at once.
For example, here are three ways to calculate median fire size in BC, in ha, grouped by year and month.
# NA if no fires in that year and month
z <- tapply(BCfires1$SIZE_HA, INDEX = list(BCfires1$YEAR, BCfires1$MONTH),
FUN = median, na.rm = TRUE)
head(z)
## 1 2 3 4 5 6 7 8 9 10 11 12
## 1950 NA NA 5.00 2.00 4.0 14.1 5.0 6.2 10.1 1.2 NA 4.00
## 1951 NA NA 285.15 11.90 5.2 7.2 14.1 12.1 8.0 NA 32.3 NA
## 1952 NA NA 10.65 4.80 14.9 2.0 13.1 8.0 8.0 8.0 26.3 NA
## 1953 NA NA 1.30 11.70 9.7 2.2 2.6 2.4 1.8 2.4 NA NA
## 1954 NA NA 16.65 9.05 4.0 14.1 4.8 2.6 2.6 1.4 NA 12.10
## 1955 NA NA NA 11.05 6.0 6.0 2.6 4.3 2.8 5.0 NA 60.65
z <- aggregate(SIZE_HA ~ MONTH + YEAR, data = BCfires1,
FUN = median, na.rm = TRUE)
head(z)
## MONTH YEAR SIZE_HA
## 1 3 1950 5.0
## 2 4 1950 2.0
## 3 5 1950 4.0
## 4 6 1950 14.1
## 5 7 1950 5.0
## 6 8 1950 6.2
# dplyr method
BCfires1.grouped <- group_by(BCfires1, YEAR, MONTH)
z <- summarize(BCfires1.grouped,
totAreBurned = median(SIZE_HA, na.rm = TRUE))
head(z)
## # A tibble: 6 × 3
## # Groups: YEAR [1]
## YEAR MONTH totAreBurned
## <int> <int> <dbl>
## 1 1950 3 5
## 2 1950 4 2
## 3 1950 5 4
## 4 1950 6 14.1
## 5 1950 7 5
## 6 1950 8 6.2
The dplyr
summarize()
command can calculate
more than one descriptive statistic at once. Here we calculate both
total area burned and median fire size (ha) by year.
z <- summarize(group_by(BCfires1, YEAR),
totAreBurned = sum(SIZE_HA, na.rm = TRUE),
medianFireSize = median(SIZE_HA, na.rm = TRUE))
head(z)
## # A tibble: 6 × 3
## YEAR totAreBurned medianFireSize
## <int> <dbl> <dbl>
## 1 1950 343038. 8
## 2 1951 123688. 10.1
## 3 1952 61372. 10.1
## 4 1953 15398. 2.8
## 5 1954 4126. 4.8
## 6 1955 22748. 4
To show a frequency distribution for a single actegorical variable.
Here’s the basic method. A
is a categorical variable
identifying groups.
# base R
barplot(table(mydata$A), col = "firebrick",
space = 0.2, cex.names = 1.2)
# Using ggplot
ggplot(mydata, aes(x = A)) +
geom_bar(stat="count", fill = "firebrick")
For an example, I use the birthDay
data set, which has
the week day of birth of a random sample of 350 babies born in the USA
in 1999. The option las = 2
sets all the axis labels
perpendicular to the axes.
# Base R
barplot(table(birthDay$day), col = "firebrick", space = 0.2,
las = 2, ylab = "Frequency")
Using ggplot()
instead:
ggplot(birthDay, aes(x = day)) +
geom_bar(stat="count", fill = "firebrick") +
labs(x = "", y = "Frequency") +
theme_classic() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.3))
R arranges the categories of a character variable in alphabetical
order, which is rarely desired. Convert to a factor using the
factor()
command to specify a meaningful order for the
categories using thelevels
option. For example, put the
weekdays in the proper consecutive order, then redraw.
birthDay$day <- factor(birthDay$day, levels = c("Monday","Tuesday",
"Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
barplot(table(birthDay$day), col = "firebrick", space = 0.2,
las = 2, ylab = "Frequency")
Often the variable of interest might lack a natural ordering of categories and then the best option is to plot the bars in order of decreasing frequency.
# Using base R
barplot(sort(table(birthDay$day), decreasing = TRUE),
col = "firebrick", space = 0.2, las = 2, ylab = "Frequency")
# Using ggplot(), with additional text options
birthDay$day_ordered <- factor(birthDay$day,
levels = names(sort(table(birthDay$day), decreasing = TRUE)) )
ggplot(birthDay, aes(x = day_ordered)) +
geom_bar(stat="count", fill = "firebrick") +
labs(x = "Weekday", y = "Frequency") +
theme_classic() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.3),
text = element_text(size = 15), axis.text = element_text(size = 12))
If the frequencies are already tabulated in a data frame named
mytab
, then modify as follows. A
is a variable
in mytab
that lists each named category exactly once and
Freq
is a variable containing the corresponding frequency
of cases in each category.
mytab <- arrange(mytab, desc(Freq))
barplot(mytab$Freq, names.arg = mytab$A, col = "firebrick")
# or
mytab$A_ordered <- factor(mytab$A, levels = mytab$A[order(mytab$Freq, decreasing = TRUE)] )
ggplot(mytab, aes(x = A_ordered, y = Freq)) +
geom_bar(stat="identity", fill = "firebrick") +
labs(x = "A group", y = "Frequency") +
theme_classic() +
theme(text = element_text(size = 15),
axis.text = element_text(size = 12), aspect.ratio = 0.8)
This and the mosaic plot (next) compare the frequencies of a categorical variable by two (or more) groups.
The basic bar plot command is below A
and B
can be factors or character variables in a data frame
mydata
.
# Basic bar plot in base R
barplot(table(mydata$A, mydata$B), beside = TRUE)
I illustrate using the association between passenger survival and gender in the Titanic data.
# Order the survival categories so that "Survived" comes first
titanic$survival <- factor(titanic$survival, levels = c("Survived", "Died"))
barplot(table(titanic$survival, titanic$gender), beside = TRUE,
las = 1, col = c("goldenrod1", "firebrick"), cex.names = 0.8, space = c(0.2,0.8),
xlab = "Gender", ylab = "Frequency", legend.text = TRUE)
Here’s the grouped bar graph using ggplot()
. The
argument position_dodge2(preserve="single")
handles
category combinations having a count of 0 (not an issue with this
example).
ggplot(titanic, aes(x = gender, fill = survival)) +
geom_bar(stat = "count", position = position_dodge2(preserve="single")) +
scale_fill_manual(values=c( "goldenrod1", "firebrick")) +
labs(x = "Gender", y = "Frequency") +
theme_classic() +
theme(aspect.ratio = 0.80)
Here’s the basic command in base R.
mytable <- table(mydata$B, mydata$A)
mosaicplot(t(mytable), color = c("red","blue","yellow"),
las = 2, cex.axis = 0.8)
Construct the table with variables in the order shown above, so that
B
is the response (dependent) variable and A
is the explanatory variable. Then use the t()
function to
transpose the table in the mosaicplot()
function so that
both the table and the mosaic plot have the explanatory variable on the
x
axis and the response variables on the y
axis.
For example, the association between passenger survival and gender can be viewed in a mosaic plot using the following data.
# Order the categories so that "Survived" comes first so on top
titanic$survival <- factor(titanic$survival, levels = c("Survived", "Died"))
titanicTable <- table(titanic$survival, titanic$gender)
titanicTable
##
## Female Male
## Survived 316 338
## Died 109 1329
par(pty = "s") # makes a square plot
mosaicplot( t(titanicTable), col = c("goldenrod1", "firebrick"), cex.axis = 0.8,
sub = "Gender", ylab = "Relative frequency", main = "")
To draw a mosaic plot using ggplot
, you’ll first need to
install and load the ggmosaic
package.
# Order the categories so that "Died" comes first so on bottom
titanic$survival <- factor(titanic$survival, levels = c("Died", "Survived"))
ggplot(titanic) +
geom_mosaic(aes(x = product(survival, gender), fill=survival)) +
labs(x = "gender") +
theme_classic() +
scale_fill_manual(values=c("firebrick", "goldenrod1")) +
theme_classic() +
theme(aspect.ratio = 1) +
guides(fill = guide_legend(title = "Survival", reverse = TRUE))
Show the freqency distribution of a numeric variable.
The basic command is
hist(mydata$x, right = FALSE)
The option right = FALSE
causes all the histogram bins,
except the last one to be left-closed. In other words, the
value 1 would appear in the interval 1-2 rather than in the interval
0-1, which is the convention.
For example, here is a histogram of the frequency of forest fires in
British Columbia, Canada, from 1950 to 2021, by month. Even using
right = FALSE
R has put December fires into the 11-12 month
bin. To control this, use breaks = seq(0, 13, by = 1)
instead, as in the subsequent plot below.
hist(BCfires1$MONTH, right = FALSE, col = "firebrick", breaks = 12,
xlab = "Month", ylab = "Forest fires 1950 - 2021", main = "", las = 1)
Or use ggplot()
ggplot(BCfires1, aes(x = MONTH)) +
geom_histogram(fill = "firebrick", col = "black",
breaks = seq(0, 13, 1), closed = "left") +
labs(x = "Month", y = "Forest fires 1950 - 2021") +
theme_classic() +
theme(aspect.ratio = 0.80)
To display probability density in base R instead of raw frequencies,
add prob = TRUE
to your arguments. I show this in the next
section, where I also add a normal curve to the plot
# Not run
hist(BCfires1$MONTH, right = FALSE, col = "firebrick", breaks = 12,
xlab = "Month", ylab = "Forest fires 1950 - 2021", main = "",
las = 1, prob = TRUE)
If using ggplot()
,
# Not run
ggplot(BCfires1, aes(x = MONTH)) +
geom_histogram(aes(y = after_stat(density)), fill = "firebrick",
col = "black", breaks = seq(0, 13, 1), closed = "left") +
labs(x = "Month", y = "Probability density") +
theme_classic() +
theme(aspect.ratio = 0.80)
To superimpose a normal density curve on a probability density
histogram, try the following commands. First, 101 evenly spaced points
along the x
axis, are made between the smallest and largest
data value using seq
. Then dnorm()
generates
the normal density at each x
point, using the mean and
standard deviation of the data.
hist(BCfires1$MONTH, right = FALSE, col = "firebrick", breaks = 12,
xlab = "Month", ylab = "Forest fires 1950 - 2021", main = "",
las = 1, prob = TRUE, ylim = c(0, 0.25))
m <- mean(BCfires1$MONTH, na.rm = TRUE)
s <- sd(BCfires1$MONTH, na.rm = TRUE)
xpts <- seq(from = min(BCfires1$MONTH, na.rm=TRUE),
to = max(BCfires1$MONTH, na.rm = TRUE),
length.out = 101)
lines(dnorm(xpts, mean=m, sd=s) ~ xpts, col="red", lwd=2)
With ggplot()
, add a stat_function()
to get
the same,
# Not run
ggplot(BCfires1, aes(x = MONTH)) +
geom_histogram(aes(y = after_stat(density)), fill = "firebrick",
col = "black", breaks = seq(0, 13, 1), closed = "left") +
stat_function(fun = dnorm,
args = list(mean = mean(BCfires1$MONTH, na.rm = TRUE),
sd = sd(BCfires1$MONTH, na.rm = TRUE))) +
labs(x = "Month", y = "Probability density") +
theme_classic() +
theme(aspect.ratio = 0.80)
Use facet_wrap()
with ggplot()
to create a
vertical stack of histograms, allowing visual comparison.
The basic command is
ggplot(mydata, aes(x = x)) +
geom_histogram(closed = "left") +
facet_wrap( ~ A, ncol = 1, scales = "free_y")
The example below used the mink
data set on brain sizes
of wild, farmed, and feral mink. The trend shows that farmed mink have
small brains and feral mink regain the larger brains of their wild
ancestors.
head(mink)
## id_number category date sex age_months cbl vol vol3 vol3.cpc
## 1 1 farm 2009-11-10 F 6 67.18 41.32 3.457165 3.427563
## 2 2 farm 2009-11-10 F 6 64.22 38.97 3.390342 3.441955
## 3 3 farm 2009-11-10 F 6 64.88 41.23 3.454653 3.476658
## 4 4 farm 2009-11-10 F 18 64.98 40.36 3.430181 3.455552
## 5 5 farm 2009-11-10 F 6 65.31 37.86 3.357842 3.392389
## 6 6 farm 2009-11-10 F 6 66.28 36.09 3.304677 3.330347
# brain volume of males by category of mink
ggplot(subset(mink, sex == "M"), aes(x = vol3.cpc)) +
geom_histogram(fill = "firebrick", col = "black", closed = "left",
boundary = 0, binwidth = 0.02) +
labs(x = "Cube root brain volume (mm)", y = "Frequency of males") +
theme_classic() +
theme(aspect.ratio = 0.4) +
facet_wrap( ~ category, ncol = 1, scales = "free_y")
Accomplishing the same in base R is more strenuous.
# Not run
par(mfrow=c(3,1), mar = c(4, 4, 2, 1), cex = 0.7)
for( i in unique(mink$category) ){
dat <- subset(mink, category == i & sex == "M")
hist(dat$vol3.cpc, breaks = seq(3.3, 3.7, by = 0.02), right = FALSE,
xlim = range(mink$vol3.cpc, na.rm = TRUE),
col="firebrick", main = i,
xlab = "", ylab = "Frequency")
}
mtext("Cube root brain volume (mm)", side = 1, padj = 3)
Compare the distribution of the variable x
to the normal
distribution.
qqnorm(mydata$x)
qqline(mydata$x) # adds line through first and third quartiles
The same can be accomplished in ggplot
as follows.
ggplot(mydata, aes(sample = x)) +
geom_qq() +
geom_qq_line() +
theme_classic()
A strip chart (a.k.a dot plot) is like a scatter plot but the horizontal (explanatory) variable is categorical.
The basic command is as follows. A
is the name of the
categorical grouping variable and y
is the name of the
numeric response variable. A small amount of horizontal “jitter” is
added to reduce overlap of points.
# base R
stripchart(y ~ A, vertical = TRUE, data = mydata,
method = "jitter")
# ggplot
ggplot(mydata, aes(A, y)) +
geom_jitter()
For example, I’m using the brain size in mink data set used earlier.
The option jitter
controls the amount of horizontal
“jitter”.
# brain volume of females by category of mink
stripchart(vol3.cpc ~ category, vertical = TRUE,
data = subset(mink, sex == "F"),
method = "jitter", jitter = 0.1, pch = 16, col = "goldenrod1",
las = 1, xlab = "Category (females)", ylab = "Cube root brain volume (mm)")
A similar plot using ggplot()
is below.
width
controls the amount of horizontal “jitter”.
# Not run
ggplot(subset(mink, sex == "F"), aes(category, vol3.cpc)) +
geom_jitter(color = "goldenrod1", size = 3, width = 0.1, alpha = 0.5) +
labs(x = "Category (females)", y = "Cube root brain volume (mm)") +
theme_classic() +
theme(aspect.ratio = 0.80, text = element_text(size = 12),
axis.text = element_text(size = 10))
Use separate stat_summary()
commands to overlay means
and standard error bars. Use fun.data = mean_cl_normal
to
get 95% confidence intervals instead of standard error bars. Shift the
position of the error bars in the x
direction using
position_nudge()
to eliminate overlap with the raw
data.
ggplot(subset(mink, sex == "F"), aes(category, vol3.cpc)) +
geom_jitter(color = "goldenrod1", size = 3, width = 0.2, alpha = 0.5) +
stat_summary(fun.data = mean_se, geom = "errorbar",
width = 0.1, position = position_nudge(x = 0.3)) +
stat_summary(fun = mean, geom = "point",
size = 2, position=position_nudge(x = 0.3)) +
labs(x = "Category (females)", y = "Cube root brain volume (mm)") +
theme_classic() +
theme(aspect.ratio = 0.80, text = element_text(size = 12),
axis.text = element_text(size = 10))
In base R you can also add error bars stripchart()
by
using integer numbers to indicate position of categories along the
x-axis. For example, to add means and standard errors to a strip chart
for a numeric variable y
and a categorical variable
A
having four categories, use
stripchart(y ~ A, data = mydata, vertical = TRUE, method = "jitter", pch = 16)
m <- tapply(mydata$y, mydata$A, mean, na.rm = TRUE)
se <- tapply(mydata$y, mydata$A,
function(y){ sd(y, na.rm=TRUE)/sqrt(length(na.omit(y))) })
points( m ~ c(1:4 + 0.2) + 0.2, pch=16, col = "firebrick")
segments(x0 = c(1:4 + 0.2), y0 = m - se,
x1 = c(1:4 + 0.2), y1 = m + se, col = "firebrick")
A grouped strip chart displays a numeric response variable
y
for different levels of two categorical variables. This
can be accomplished in base R by overlaying multiple strip charts, but
it is also possible in ggplot()
using
position_jitterdodge
to reduce overlap among points.
ggplot(data = mink, aes(y = vol3.cpc, x = category, fill = sex, color = sex)) +
geom_point(size = 3, position = position_jitterdodge(jitter.width = 0.2),
dodge.width = 0.4, alpha = 0.5) +
labs(x = "Category", y = "Cube root brain volume (mm)") +
scale_color_manual(values=c("goldenrod1", "firebrick")) +
theme_classic() +
theme(aspect.ratio = 0.80, text = element_text(size = 16),
axis.text = element_text(size = 14))
If data are paired (two measurements of the same individual, two attributes of the same population, then paired points should be connected with a line to indicate this connection.
The following commands create a strip chart with lines connection the
two measurements of the same unit in the two treatments. The data frame
mydata
includes the response variable y
, the
treatment variable A
, and an id
variable
indicating identity of individuals measured under both treatments.
The following method works in base R.
interaction.plot(response = mydata$y, x.factor = mydata$A, trace.factor = mydata$id,
legend = FALSE, lty = 1, xlab = "Treatment", ylab = "Y variable",
type = "b", pch = 16, las = 1)
For example, the following are data on immunocompetence of red-winged
blackbirds before and after testosterone implants (bbird
data set). The treatment variable is converted to a factor to order the
categories.
bbird$treatment <- factor(bbird$treatment, levels = c("before","after"))
head(bbird)
## blackbird competence logCompetence treatment
## 1 1 105 4.65 before
## 2 2 50 3.91 before
## 3 3 136 4.91 before
## 4 4 90 4.50 before
## 5 5 122 4.80 before
## 6 6 132 4.88 before
interaction.plot(response = bbird$competence,
x.factor = bbird$treatment, trace.factor = bbird$blackbird,
legend = FALSE, lty = 1, xlab = "Treatment", ylab = "Immunocompetence",
type = "b", pch = 16, las = 1, col = "firebrick")
The following does the same with ggplot()
.
alpha
makes the points partly transparent so that
overlapping points can be partly resolved.
ggplot(bbird, aes(y = competence, x = treatment)) +
geom_line(aes(group = blackbird)) +
geom_point(size = 3, col = "firebrick", alpha = 0.5) +
labs(x = "Treatment", y = "Immunocompetence") +
theme_classic() +
theme(text = element_text(size = 14))
The easiest ggplot
method for comparing paired data
among groups is to make a separate paired-data plot for each group using
facet_wrap()
and tweak the theme()
options to
make it look like a single, integrated plot.
The following example uses the stickle
data set,
containing population means of males and females from threespine
stickeleback populations, to show that limnetic and benthic ecotypes
repeatedly have opposite sexual size dimorphism.
head(stickle)
## Lake Population Ecotype Sex length raker.number
## 1 Emily Emily-B benthic female 54.85 18.92
## 2 Enos Enos-B benthic female 51.26 18.67
## 3 Hadley Hadley-B benthic female 56.71 19.42
## 4 Paxton Paxton-B benthic female 52.07 17.79
## 5 Priest Priest-B benthic female 55.16 18.71
## 6 Emily Emily-L limnetic female 44.44 23.63
ggplot(data = stickle, aes(y = length, x = Sex, fill = Sex, color = Ecotype)) +
geom_line(aes(group = Population), col = "black") +
geom_point(aes(shape = Sex), size = 2.5,
position = position_dodge(width = 0.2), show.legend = TRUE) +
facet_wrap(~ Ecotype, nrow = 1, strip.position = "top") +
labs(x = "Sex", y = "Body length (mm)") +
scale_color_manual(values=c("indianred4", "lightcoral", "darkcyan")) +
scale_shape_manual(values = c(16, 17)) +
theme_classic() +
theme(strip.background = element_blank(), strip.text.x = element_blank())
The following code generates a box plot for the numeric variable
yvar
separately for every group identified in the
categorical variable A
.
The following shows use of the formula method in
barplot()
, written as “response_variable ~
explanatory_variable”. If varwidth = TRUE
, box width is
proportional to the square root of sample size — set to
FALSE
if you want all boxes to have the same width.
boxplot(yvar ~ A, data = mydata, varwidth = TRUE,
ylab="y value", col = "firebrick", cex.axis = 0.8, las = 1)
For example, here is a box plot of of brain sizes of female mink. Box width reflects sample size.
head(mink)
## id_number category date sex age_months cbl vol vol3 vol3.cpc
## 1 1 farm 2009-11-10 F 6 67.18 41.32 3.457165 3.427563
## 2 2 farm 2009-11-10 F 6 64.22 38.97 3.390342 3.441955
## 3 3 farm 2009-11-10 F 6 64.88 41.23 3.454653 3.476658
## 4 4 farm 2009-11-10 F 18 64.98 40.36 3.430181 3.455552
## 5 5 farm 2009-11-10 F 6 65.31 37.86 3.357842 3.392389
## 6 6 farm 2009-11-10 F 6 66.28 36.09 3.304677 3.330347
boxplot(vol3.cpc ~ category, data = subset(mink, sex == "F"), varwidth = TRUE,
ylab="Cube root brain volume (mm)", col = "goldenrod1", cex.axis = 0.9,
xlab = "Category (females)", las = 1)
A similar box plot can be made using ggplot()
. If
varwidth = TRUE
, box width is proportional to the square
root of sample size (set this option to FALSE
if you want
all boxes to have the same width).
For example, using the mink data set,
# Not run
ggplot(subset(mink, sex == "F"), aes(x = category, y = vol3.cpc)) +
geom_boxplot(fill = "goldenrod1", varwidth = TRUE) +
labs(x = "Category (females)", y = "Cube root brain volume (mm)") +
theme_classic() +
theme(text = element_text(size = 14))
A grouped box plot displays a numeric response variable
y
for different levels of two categorical variables,
A
and B
. This can be accomplished in base R by
overlaying multiple box plots, but it is easier to produce the plot with
ggplot()
.
ggplot(data = mydata, aes(y = y, x = A, fill = B)) +
geom_boxplot(width = 0.6, position = position_dodge(width = 0.7)) +
labs(x = "A variable", y = "y variable") +
theme_classic()
Box width is controlled by width
, whereas
position_dodge(width = )
controls the horizontal distance
between the adjacent boxes depicting different categories of the
B
variable within groups of the A
variable.
For example, here is the grouped box plot for the mink
data set on (size-corrected) brain sizes in farmed, wild, and feral
minks. Compare with the earlier strip chart of the same data.
ggplot(data = mink, aes(y = vol3.cpc, x = category,
fill = sex)) +
geom_boxplot(width = 0.6, position = position_dodge(width = 0.7)) +
labs(x = "Category", y = "Cube root brain volume (mm)") +
scale_fill_manual(values=c("goldenrod1", "firebrick")) +
theme_classic() +
theme(aspect.ratio = 0.8, text = element_text(size = 14))
A violin plot shows the frequency distribution for a numerical variable (and its mirror image) for several groups. The frequency distribution is a kernel density estimate, which “smooths” the distribution. Some find this graph more intuitive than the box plot.
The basic ggplot()
violin plot is as follows.
yvar
is a numeric variable and A
is a
categorical variable. stat_summary()
ads a dot for each
category mean.
ggplot(mydata, aes(x = A, y = yvar)) +
geom_violin(fill = "firebrick") +
stat_summary(fun.y = mean, geom = "point", color = "black",
For example, the following code generates a grouped violin plot for the mink brain size data set.
ggplot(data = mink, aes(y = vol3.cpc, x = category,
fill = sex)) +
geom_violin(position = position_dodge(width = 0.8)) +
stat_summary(fun.y = mean, geom = "point", color = "black",
position = position_dodge(width = 0.8)) +
labs(x = "Category", y = "Cube root brain volume (mm)") +
scale_fill_manual(values=c("goldenrod1", "firebrick")) +
theme_classic() +
theme(aspect.ratio = 0.8, text = element_text(size = 14))
The job can be done in base R, but you’ll need to install the
vioplot
package first using
install.packages()
, and then load it. Here’s the code
applied to the mink data for males. You’ll need to tweak the value for
h
to control the degree of smoothing.
# Not run
vioplot(vol3.cpc ~ category, data = subset(mink, sex == "M"),
col="firebrick", drawRect = FALSE, las = 1, h = 0.03,
xlab = "Category (males)", ylab = "Cube root brain volume (mm)")
Here’s how to to produce a scatter plot for two numeric variables,
x
and y
.
# formula method, base R
plot(y ~ x, data = mydata, pch = 16, col = 2)
# basic scatter plot in ggplot
ggplot(mydata, aes(x, y)) + geom_point()
For example, here is a scatter plot of brain size (vol3
,
cube root of volume) against a measure of body size (condylobasal
length, cbl
) in the mink
data set. The option
alpha
makes the dots partly transparent.
ggplot(mink, aes(x = cbl, y = vol3)) +
geom_point(size = 2, col = "firebrick", alpha = 0.5) +
labs(x = "Condylobasal length (mm)", y = "Cube root of brain volume (mm)") +
theme_classic() +
theme(aspect.ratio = 0.80)
To add a smooth curve through the data, use locally weighted
regression. “Local” here means that y
is predicted for each
x
using only data in the vicinity of that x
,
rather than all the data.
In base R you can control the size of the vicinity using the option
f
, which is a proportion between 0 (very narrow vicinity)
and 1 (uses all the data). Try different values of f
to
best capture the relationship. The default is f = 2/3
. The
method works best if the variables are ordered by x
.
plot(y ~ x, data = mydata)
x1 <- mydata$x[order(mydata$x)]
y1 <- mydata$y[order(mydata$x)]
lines(lowess(x1, y1, f = 0.5))
Using ggplot()
, you also get SE’s of predicted values
(set to FALSE if not desired).
ggplot(mink, aes(x = cbl, y = vol3)) +
geom_point(size = 2, col = "firebrick", alpha = 0.5) +
geom_smooth(method = "loess", linewidth = 1, col = "black", se = TRUE) +
labs(x = "Condylobasal length (mm)", y = "Cube root of brain volume (mm)") +
theme_classic() +
theme(aspect.ratio = 0.80)
To add the least squares regression line to a plot, use the following.
# Not run
# base R
plot(y ~ x, data = mydata)
z <- lm(y ~ x, data = mydata)
abline(z)
# in ggplot
ggplot(mydata, aes(x, y)) +
geom_point(size = 2, col = "firebrick") +
labs(x = "x variable", y = "y variable") +
geom_smooth(method = "lm", linewidth = 1, col = "black") +
theme_classic() +
theme(aspect.ratio = 0.80)
In base R you can use the cursor to click data points on the graph to
identify individuals. The following code prints the row number when you
click a point (change that by setting the labels
option to
a character variable that labels the point instead).
plot(y ~ x, data = mydata)
identify(mydata$x, mydata$y, labels = seq_along(mydata$x))
A single scatter plot of y
on x
can show
multiple groups if you vary the color and shape of dots.
In base R, use pch
to vary the symbol, col
to vary the color, and both to vary color and shape at the same time.
The legend
command adds a legend identifying the
groups—click on the plot (inside the plot region) with your cursor to
place the legend.
# Not run
plot(y ~ x, data = mydata, pch = as.numeric(factor(mydata$A)),
col = as.numeric(factor(mydata$A)))
legend( locator(1), legend = as.character(levels(factor(mydata$A))),
pch = 1:length(levels(factor(mydata$A))),
col=1:length(levels(factor(mydata$A))) )
To accomplish the same with ggplot()
, specify which
variable to vary by color and by shape within aes()
mapping. For example in the mink data we can use color and shape to
differentiate the three categories. The geom_smooth()
command is used to fit each group with a regression line.
ggplot(subset(mink, sex == "M"),
aes(x = cbl, y = vol3, colour = category, shape = category)) +
geom_point(size = 2, alpha = 0.7) +
geom_smooth(method = "lm", linewidth = 1, se = FALSE) +
labs(x = "Condylobasal length (mm)",
y = "Cube root of brain volume (mm)") +
ggtitle("Mink brain and body size - males") +
theme_classic() +
theme(aspect.ratio = 0.80, plot.title = element_text(hjust = 0.5))
To constrain regression lines for all the groups to have the same slopes (but different intercepts), some pre-calculations are required.
femaledata <- subset(mink, sex == "F")
fitmink <- lm(log(vol3) ~ log(cbl) + category, data = femaledata,
na.action = na.exclude)
femaledata$predict <- predict(fitmink)
ggplot(femaledata, aes(x = log(cbl), y = log(vol3),
colour = category, shape = category)) +
geom_point(size = 2, alpha = 0.8) +
geom_line(aes(y = predict)) +
labs(x = "Condylobasal length (mm)",
y = "Cube root of brain volume (mm)") +
theme_classic() +
theme(aspect.ratio = 0.80)
Another way to make a scatter plot for multiple groups is to plot
them separately in panes of a panel of plots. Use
facet_wrap()
with ggplot()
for this purpose,
as follows. The lattice
package can also be used to make
panels of plots, as described briefly at the bottom of this page.
ggplot(subset(mink, sex == "F"),
aes(x = cbl, y = vol3, colour = category, shape = category)) +
geom_point(size = 2, alpha = 0.7) +
geom_smooth(method = "lm", linewidth = 1, se = FALSE) +
facet_wrap(~ category, ncol = 2) +
labs(x = "Condylobasal length (mm)",
y = "Cube root of brain volume (mm)") +
theme_classic() +
theme(aspect.ratio = 0.80)
The following command creates a single graph with all pairwise
scatter plots for variables x1
to x5
in the
data frame, mydata
. The option gap
adjusts the
spacing between separate plots,
pairs(~ x1 + x2 + x3 +x4 + x5, data = mydata)
Here’s how to plot a sequence of (x, y
) points and
connect the dots with lines. This is especially useful when the
x
-variable represents a series of points in time or across
a spatial gradient.
In base R:
plot(y ~ x, data = mydata, pch=16)
plot(y[order(x)] ~ x[order(x)], data = mydata, type = "l",
lty = 3, lwd = 2, col = "red")
For example, the following plots the total area burned in BC forest fires from 1950 to 2021.
# First calculate the cumulative area of all fires of at least 1 ha each year
z <- aggregate(SIZE_HA ~ YEAR, data = BCfires1, FUN = sum, na.rm = TRUE)
# Line plot of log area
plot(log10(SIZE_HA) ~ YEAR, data = z, type = "l", las = 1, cex.axis = 0.7,
xlab = "Year", ylab = "log10 total area burned (ha)")
# Optionall addition of points
points(log10(SIZE_HA) ~ YEAR, data = z, add = TRUE, pch = 16, cex = 0.5)
To do the same in ggplot
, use the following. I added
geom_smooth()
to show the trend.
# First calculate the cumulative area of all fires of at least 1 ha each year
z <- aggregate(SIZE_HA ~ YEAR, data = BCfires1, FUN = sum, na.rm = TRUE)
# Line plot
ggplot(z, aes(x = YEAR, y = log10(SIZE_HA))) +
geom_line() +
geom_point(size = 0.5) +
labs(x = "Year", y = "log10 total area burned") +
geom_smooth(method = "loess", linewidth = 1, col = "firebrick", se = FALSE) +
theme_classic() +
theme(aspect.ratio = 0.80, text = element_text(size = 16),
axis.text = element_text(size = 14))
The command interaction.plot
is quick but rudimentary,
as it fails to show the data.
Interaction plots display how the mean of a numeric response variable
y
changes between the levels of two categorical variables,
A
and B
. The graph is especially useful for
determine whether an interaction is present between two factors
A
and B
in a factorial experiment, or between
a factor A
and a blocking variable B
. If the
lines are parallel then there is no interaction.
z <- aggregate(y ~ A + B, data = mydata, FUN = mean, na.rm = TRUE)
interaction.plot(z$A, z$B, response = z$y, type = "b")
For example, the mink data show roughly parallel lines between the sexes, indicating little interaction between sex and category.
# First calculate the means for each combination of the categorical variables
z <- aggregate(vol3.cpc ~ category + sex, data = mink, FUN = mean, na.rm = TRUE)
# Then draw the plot
interaction.plot(x.factor = z$category, trace.factor = z$sex,
response = z$vol3.cpc, type = "b", lty = 1, las = 1,
legend = TRUE, pch = 16, col = c("goldenrod1", "firebrick"))
lattice
packageThe lattice
package makes it easy to draw a panel of
plots separately by group. The basic plot is simple, but commands to add
points and lines to the individual panes can be tricky.
The lattice
package is included with the basic
installation but you need to load the library. The graph types available
in the lattice package include the standard ones found also in R’s basic
graphics package, such as box plots, histograms, and so on. The table
below lists the most types and the relevant command.
For example, to draw a histogram of a numeric variable x
separately for four groups identified by the variable B
in
the data frame mydata
, use
library(lattice)
histogram(~ x | B, data = mydata, layout = c(1,4), right = FALSE)
The layout
option is special to lattice and draws the 4
panels in a grid with 1 column with 4 rows, so that the histograms are
stacked and most easily compared visually. The right=FALSE
option has the same meaning here as for the base R hist
command.
To draw a bar graph showing the frequency distribution of a
categorical variable A
separately for each group identified
by the variable B
,
barchart( ~ table(A) | B, data = mydata)
This produces horizontal bar graphs, which leaves room for the category labels. To draw the bars vertically instead, while tilting the group labels on the x-axis by 45 degrees so that they fit,
barchart(table(A) ~ names(table(A)) | B, data = mydata,
scales = list(x=list(rot=45)))
As a third example, draw a scatter plot to show the relationship
between the numeric variables x
and x
separately for each group in the variable B
. The
pch
option in this example replaces the default plot symbol
with a filled dot, and the aspect
option sets the relative
lengths of the vertical and horizontal axes.
xyplot(y ~ x | B, data = mydata, pch = 16, aspect = 0.7)
It is possible to add plot elements to individual panels, but the
commands and options take some getting used to. For example, to fit a
separate regression line to each scatter plot, one to each group, use
the panel
argument in xyplot
to construct a
function that applies built-in panel functions to each group, as
follows.
xyplot(y ~ x | B, data = mydata, pch = 16, aspect = 0.7,
panel=function(x, y){ # Use x and y here, not real variable names
panel.xyplot(x, y) # draws the scatter plot
panel.lmline(x, y) # fits the regression line
}
)
This doesn’t even begin to describe what’s possible using the lattice package. Crawley has a few more examples of trellis graphics in The R Book. Sarkar (2008) gives a complete description. See the links to these books on the “Books” tab of the Biology 501 course page.
The table below shows a few of the commonly used plotting commands in
the lattice package. x
and y
are numeric
variables, whereas A
is a categorical variable (character
or factor). B
is a factor or character variable that will
define the groups or subsets of the data frame to be plotted in separate
panels. A separate plot in the graphics window will be made for each of
the groups defined by the variable B
.
Command | Graph type | Syntax (~ is the “tilde” symbol) |
barchart
|
bar graph |
barchart(~table(A) | B, data=mydata)
|
bwplot
|
box plot |
bwplot(x ~ A | B, data=mydata)
|
histogram
|
histogram |
histogram(~x | B, data=mydata, right=FALSE)
|
stripplot
|
strip chart |
stripplot(x ~ A | B, data=mydata, jitter=TRUE)
|
xyplot
|
scatter plot |
xyplot(y ~ x | B, data=mydata)
|
© 2009-2024 Dolph Schluter