The ape
package in R contains tools for phylogenetic
comparative methods. For more details on how to use the package see the
online book by E. Paradis, Analysis of phylogenetics and evolution
with R. A link to this book can be found on the “Books” tab of the
course main web pages. The phytools
package also includes
many useful commands.
Load the package to begin. You will also need to load the
nlme
and visreg
packages for some of the
functions on this page.
library(ape)
library(phytools)
library(nlme)
library(visreg)
This section explains commands to read and write files containing phylogenetic trees, and to display trees in the plot window in R.
Phylogenetic trees are usually represented in one of two text formats, newick and nexus. The formats are explained below. To read a tree contained in a text file, use the corresponding command.
mytree <- read.tree("/directoryname/myfile.tre") # newick format
mytree <- read.nexus("/directoryname/myfile.nex") # nexus format
The resulting object mytree
is a special type of list.
The following commands access some of the information stored in
mytree
.
mytree # prints basic tree information
mytree$tip.label # species names
mytree$edge.length # lengths of all tree branches
mytree$edge # identity of branches (from node, to node)
Use the plot command to view a phylogeny stored in the object
mytree
. Include the cex
option to reduce
overlap of species labels.
plot(mytree, cex=0.7)
write.tree("/directoryname/myfile.tre") # newick format
write.nexus("/directoryname/myfile.nex") # nexus format
The newick format is used by most programs that estimate phylogenies. Trees in this format look like the following (you might need to scroll right within the box to see the entire line):
((((Homo:0.21,Pongo:0.21):0.28,Macaca:0.49):0.13,Ateles:0.62):0.38,Galago:1.00);
Each species at the tips of the phylogeny has a name followed by a number indicating the length of the branch connecting it to its immediate ancestor. Sister taxa – those sharing an immediate common ancestor – are wrapped by parentheses. In the above example, the innermost parentheses indicate that Homo and <i”>Pongo are sister taxa. The next set of parentheses outward indicate that Macaca is the sister to the Homo/Pongo group. And so on. </i”>
The nexus format is also widely used. The same tree shown above in newick format looks like the following in nexus format:
BEGIN TAXA;
DIMENSIONS NTAX = 5;
TAXLABELS
Homo
Pongo
Macaca
Ateles
Galago
;
END;
BEGIN TREES;
TRANSLATE
1 Homo,
2 Pongo,
3 Macaca,
4 Ateles,
5 Galago
;
TREE * UNTITLED = [&R] ((((1:0.21,2:0.21):0.28,3:0.49):0.13,4:0.62):0.38,5:1);
END;
This tree can be read from the file and plotted as follows:
apetree <- read.tree(url("https://www.zoology.ubc.ca/~schluter/R/csv/GreatApes.tre"))
plot(apetree)
Trait data read from a text file should be stored in a data frame. Use the familiar commands in R to read the data, e.g.,
mydata <- read.csv("/directoryname/myfile.csv")
Make sure the species names are included in the data frame. They
should match to the letter (and case) the species names in the
mytree
object that stores the tree information (the species
names are in mytree$tip.label). So be sure to check carefully.
Some functions require that the row names of mydata
be
the species names. If the species names are contained in the variable
species
in mydata
, set the rownames as
follows
rownames(mydata) <- mydata$species
The species data in the data frame containing the trait measurements
must be in the same row order as the names of the species in the
tree object, mytree
(species names are in
mytree$tip.label
). If this is not the case then you need to
rearrange the rows of the data frame to match the order of species names
in the tree object. The following command will do this for you; it
assumes that the rownames of mydata
are the species
names:
mydata <- mydata[match(mytree$tip.label,rownames(mydata)),]
Values for a continuously-varying trait can be visualized at the tips
of the tree using dots of varying size. A larger dot indicates a larger
value for the trait. To begin, create a new data frame that contains
x
, y
, and z
).
mydat2 <- mydata[, c("x", "y", "z")]
dotTree(mytree, as.matrix(mydat2)[,c("x")]) # plot trait "x" at tree tips
Phylogenetic signal is the tendency for closely related species to be
more similar than distantly related species. Phylogenetic signal in a
continuously varying trait can be quantified using Pagel’s λ
(“lambda”). This quantity ranges between 0 and 1, with 1 representing
maximum phylogenetic signal. To begin, create a new data frame that
contains x
, y
, and z
).
mydat2 <- mydata[, c("x", "y", "z")]
phylosig(mytree, as.matrix(mydat2)[,c("x")]) # phylogenetic signal in trait "x"
This is a method that estimates and tests a regression or correlation between two variables while correcting for the nonindependence of data points resulting from phylogeny. It assumes that trait evolution mimics a Brownian motion process with unchanging rates through time and along all branches of the tree. This is a difficult assumption to verify.
Before beginning, make sure that the rows of the species data in
mydata
are in the same order as the names of the species in
the tree object, mytree
, namely
mytree$tip.label
. If not, see the previous section to
rearrange.
Use the pic() function to transform the two variables x
and y
separately into phylogenetically independent
contrasts (PICs), x1
and y1
. Then fit a simple
linear model using these contrasts but leave out the intercept
term so that the fitted line passes through the origin. The
regression is through the origin because each contrast is calculated as
a difference (e.g., species A minus species B) and the direction of the
difference is arbitrary (it could just as well have been calculated as
species B minus species A).
x1 <- pic(mydata$x, mytree)
y1 <- pic(mydata$y, mytree)
View x1
and y1
before entering the next
command. Check especially for values listed as “NaN” or “Inf”. They may
result if some of the tip branches in the tree have length 0 (the branch
lengths are given in mytree$edge.length
). See below for
suggestions on what to do in this case.
If all is well, then fit the linear model through the origin. All the
usual commands can be used to pull results from the fitted
lm
object, z
.
z <- lm(y1 ~ x1 - 1) # the "-1" forces line through origin
Often the goal is to estimate a correlation coefficient rather than a
regression coefficient. The method for calculating a correlation between
independent contrasts differs from the usual calculation for a
correlation because we need to account for the fact that the direction
of the contrast is arbitrary. Both the following calculations should
give the same answer. First calculate x1
, y1
,
and z
as in the PICs section above. Then calculate the
correlation between independent contrasts as
r <- sum(x1*y1)/(sqrt(sum(x1^2))*sqrt(sum(y1^2))) # or
r <- sqrt(summary(z)$r.squared)
The PIC method in ape
will crash if certain of the
branches of the tree have length zero. If this happens to you, inspect
the branch lengths.
mytree$edge.length # lengths of all tree branches
range(mytree$edge.length) # shortest and longest branch
Try adding a very small number to each of the branches to prevent crashes caused by zero branch lengths. Here I add a constant “0.001” to each branch.
mytree$edge.length <- mytree$edge.length + 0.001
Retry the PICs method to see if this fixes the problem.
General least squares (GLS) is a linear model technique that allows non-independent data points to be fitted when the expected similarity (“correlation”) between data points is known. In the present case, the phylogeny provides the expected correlations, assuming that trait evolution conforms to a Brownian motion or other process. The method gives identical results to PICs with two variables undergoing Brownian motion.
The advantage of GLS instead of PICs is that it provides a more flexible framework for analyzing trait evolution. Any linear model can be corrected for phylogeny. It also allows the use of other evolutionary models, not just Brownian motion. Diagnostics (e.g., residual and scatter plots based on the phylogenetically transformed data) are harder to obtain but we solve this problem below.
GLS uses a matrix containing the expected correlation between trait
values of all pairs of species. The expected correlation is proportional
to the amount of evolutionary history, from root to tips, that they
share through common descent. Basically, the phylogenetic correlation
matrix re-weights the data from each species in a linear model. The
vcv()
function calculates the matrix corresponding to a
given phylogenetic tree under the Brownian motion model of
evolution.
For example, the phylogenetic correlation matrix for the tree of great apes read earlier is as follows (assuming Brownian motion). Other evolutionary models are described below.
v <- vcv(apetree, corr = TRUE)
v
## Homo Pongo Macaca Ateles Galago
## Homo 1.00 0.79 0.51 0.38 0
## Pongo 0.79 1.00 0.51 0.38 0
## Macaca 0.51 0.51 1.00 0.38 0
## Ateles 0.38 0.38 0.38 1.00 0
## Galago 0.00 0.00 0.00 0.00 1
The gls()
command is from the nlme
library.
The variables x
and y
are assumed to be in the
data frame mytree
. We are going back to the data here and
are not using the PICs calculated earlier. Since we are modeling the
data directly rather than the contrasts, we don’t remove the intercept
term from the model formula.
z <- gls(y ~ x, data=mydata, correlation=corBrownian(1, mytree))
The results are stored in the gls()
model object
z
. This object behaves similarly to a lm()
model object from ordinary linear model fitting, and a number of the
same commands are used to pull out results.
summary(z) # coefficients, SE's, fit statistics
confint(z) # confidence intervals for coefficients
The GLS method essentially transforms the variables in your linear model to a new scale where all the usual assumptions of linear models – independent residuals having equal variance — are met (assuming that your model of evolution is the correct one). GLS then fits an ordinary linear model to these transformed variables.
You can evaluate linear model assumptions by making scatter plots and
residual plots of these transformed variables from a GLS analysis using
the lm.gls()
function at the end of this page. Cut and
paste the code for the function into your R command console. You’ll need
to load the visreg
package too.
I illustrate using data from Rolland et al (2020) “Vulnerability to fishing and life history traits correlate with the load of deleterious mutations in teleosts”, Molecular Biology and Evolution 37: 2192–2196. The linear model will fit an estimate of deleterious mutation accumulation in fish species to a measure of fish species vulnerability in the face of human exploitation.
# read the tree
fishtree <- read.tree(url("https://www.zoology.ubc.ca/~schluter/R/csv/fishtree.tre"))
fishtree
##
## Phylogenetic tree with 65 tips and 64 internal nodes.
##
## Tip labels:
## Astyanax_mexicanus, Danio_rerio, Gasterosteus_aculeatus, Myoxocephalus_scorpius, Sebastes_norvegicus, Chaenocephalus_aceratus, ...
##
## Rooted; includes branch lengths.
# read the data
fishdat <- read.csv(url("https://www.zoology.ubc.ca/~schluter/R/csv/fishdat.csv"),
row.names = 1)
head(fishdat)
## dNdS ss.Vulnerability
## Anabas_testudineus 0.08792156 12.47
## Antennarius_striatus 0.07221790 13.16
## Arctogadus_glacialis 0.10974063 31.13
## Astyanax_mexicanus 0.05488084 10.00
## Bathygadus_melanobranchus 0.10815480 54.74
## Benthosema_glaciale 0.07361320 27.28
# Calculate the phylogenetic correlation matrix
v <- vcv(fishtree, model = "Brownian", corr = TRUE)
# Fit the GLS model using the lm.gls() function
z <- lm.gls(x = fishdat$ss.Vulnerability, y = fishdat$dNdS, v = v)
# The output will include the fitted model object "lm.fit" and a data frame of results
names(z)
## [1] "lm.fit" "result"
Use summary()
to extract coefficients from the fitted
model object and anova()
to test factors.
summary(z$lm.fit)
anova(z$lm.fit)
In results data frame, x
is the transformed predictor
variable of interest and ynew
is the transformed response
variable. yhat
are the predicted values and
resid
are the residuals.
head(z$result)
## intercept x ynew yhat resid
## 1 0.8287927 10.15330 0.077005090 0.05926235 0.017742739
## 2 0.8287927 11.08866 0.055717328 0.05991080 -0.004193474
## 3 0.1766355 2.66602 0.064105281 0.01297831 0.051126967
## 4 0.1766355 -39.79760 -0.046142947 -0.01646017 -0.029682779
## 5 0.1785635 36.81465 0.053106918 0.03677380 0.016333115
## 6 0.1785635 -11.79493 -0.008038477 0.00307455 -0.011113028
Use visreg
to show a scatter plot of the transformed
variables.
visreg(z$lm.fit, "x", ylab = "Corrected dN/dS", xlab = "Corrected Vulnerability")
A residual plot is obtained by comparing resid
with
yhat
.
plot(z$result$resid ~ z$result$yhat)
Under the strict Brownian motion model, the expected correlation ρij between the trait values of any two species i and j is measured by the fraction of total phylogenetic history, from root to tip, that they share. An alternative possibility is that every correlation is weaker by the same amount, λ*ρij, where λ is a constant between 0 and 1. A λ value less than 1 implies that the effect of phylogeny is not as strong as predicted by Brownian motion, as though each species has experienced an extra bit of independent evolution since it split from its ancestor. If λ = 0 then there is no phylogenetic signal at all. The approach multiplies all the off-diagonal values of the phylogenetic correlation matrix by λ and then fits a GLS model using this modified correlation matrix.
The method is simple to implement in ape
. The three
commands below fit models to the data in which λ is 1, 0, and
0.5.
z <- gls(y ~ x, data=mydata, correlation=corPagel(1, mytree, fixed = TRUE))
z <- gls(y ~ x, data=mydata, correlation=corPagel(0, mytree, fixed = TRUE))
z <- gls(y ~ x, data=mydata, correlation=corPagel(0.5, mytree, fixed = TRUE))
The following command finds and fits the maximum likelihood value for λ, given the data. The best-fit value of λ is a useful guide to the extent to which species differences are predicted by phylogeny.
z <- gls(y ~ x, data=mydata, correlation=corPagel(1, mytree, fixed = FALSE))
For diagnostic plots (scatter and residual plots) use the
lm.gls()
as above but use a modified phylogenetic
correlation matrix.
v <- vcv(corPagel(1, fishtree, fixed=FALSE)) # or use Fixed = TRUE and supply a lambda value
Under Brownian motion, the expected or average difference between species is proportional to the amount of time since they split from a common ancestor. This assumes that species can evolve ever greater differences without constraint. However, real evolution seems to have bounds, even if they are soft. The Ornstein–Uhlenbeck (OU) process attempts to capture this idea of limits or constraints in a modified random walk that includes an “attractor” or “optimum.” (The process is sometimes said to represent a model of “stabilizing selection”, but this is confusing because the attractor isn’t an optimum in the population genetic sense.) Under an OU process, the farther a species trait evolves away from the attractor, the stronger the tendency for the next step in its random walk to be toward the attractor rather than away from it.
Use gls()
to fit an OU model to the data as follows.
z <- gls(y ~ x, data=mydata, correlation=corMartins(1, mytree, fixed=TRUE)) # λ = 1
z <- gls(y ~ x, data=mydata, correlation=corMartins(1, mytree, fixed=FALSE)) # λ is estimated
For diagnostic plots (scatter and residual plots) use the
lm.gls()
as above but use a modified phylogenetic
correlation matrix for the OU process.
v <- vcv(corMartins(1, fishtree, fixed=TRUE))
To use this function, copy and paste the code into your console.
lm.gls <- function(x, y, v){
# Function to implement general least squares
# Assumes no missing values
# x is a univariate vector or a matrix of variables
# y is a vector, the response variable
# v is the phylogenetic covariance matrix
# There is a transformation of the original x and y variables
# that would fulfill the assumptions of uncorrelated residuals having
# equal variances (see Draper and Smith p 108 for this logic in the
# case of weighted least squares for uncorrelated data; p156 for
# the case with serially correlated data). The transformation (P)
# is obtained by solving P'P=V as follows:
# Generate the design matrix X, including column of 1's for intercept
z <- lm(y ~ x)
X <- model.matrix(z)
colnames(X)[1] <- "intercept"
# Calculate pv, the "square root" of v
ve <- eigen(v)
pv <- ve$vectors %*% sqrt(diag(ve$values)) %*% t(ve$vectors)
# Transform y and X by pv
X <- as.data.frame(solve(pv) %*% X)
ynew <- solve(pv) %*% y
# Fit the gls model, corrected for phylogeny
f <- as.formula(paste("ynew ~", paste(names(X), collapse = " + "), "-1"))
lm.fit <- lm(f, data = X)
# Collect results
yhat <- predict(lm.fit)
resid <- residuals(lm.fit)
result <- data.frame(X, ynew, yhat, resid)
return(list(lm.fit = lm.fit, result = result))
}
© 2009-2024 Dolph Schluter