Phylogenetic comparative methods

Phylogenetic trees

Read tree from file
Plot tree
Write trees to file
Tree formats

Trait data

Species names
Row names
Row order
Plot continuous traits on trees
Phylogenetic signal

Independent contrasts

PICs
Correlation
Zero branch lengths

General least squares method

Phylogenetic correlation matrix
Using GLS
Diagnostic plots for GLS
Alternative evolutionary models
The Ornstein–Uhlenbeck process
gls.fit() function

The ape package in R contains tools for phylogenetic comparative methods. For more details on how to use the package see the online book by E. Paradis, Analysis of phylogenetics and evolution with R. A link to this book can be found on the “Books” tab of the course main web pages. The phytools package also includes many useful commands.

Load the package to begin. You will also need to load the nlme and visreg packages for some of the functions on this page.

library(ape)
library(phytools)
library(nlme)
library(visreg)

Phylogenetic trees

This section explains commands to read and write files containing phylogenetic trees, and to display trees in the plot window in R.

Read tree from file

Phylogenetic trees are usually represented in one of two text formats, newick and nexus. The formats are explained below. To read a tree contained in a text file, use the corresponding command.

mytree <- read.tree("/directoryname/myfile.tre")   # newick format
mytree <- read.nexus("/directoryname/myfile.nex")  # nexus format

The resulting object mytree is a special type of list. The following commands access some of the information stored in mytree.

mytree               # prints basic tree information
mytree$tip.label     # species names
mytree$edge.length   # lengths of all tree branches
mytree$edge          # identity of branches (from node, to node)

Plot tree

Use the plot command to view a phylogeny stored in the object mytree. Include the cex option to reduce overlap of species labels.

plot(mytree, cex=0.7)

Write trees to file

write.tree("/directoryname/myfile.tre")   # newick format
write.nexus("/directoryname/myfile.nex")  # nexus format

Tree formats

The newick format is used by most programs that estimate phylogenies. Trees in this format look like the following (you might need to scroll right within the box to see the entire line):

((((Homo:0.21,Pongo:0.21):0.28,Macaca:0.49):0.13,Ateles:0.62):0.38,Galago:1.00);

Each species at the tips of the phylogeny has a name followed by a number indicating the length of the branch connecting it to its immediate ancestor. Sister taxa – those sharing an immediate common ancestor – are wrapped by parentheses. In the above example, the innermost parentheses indicate that Homo and <i”>Pongo are sister taxa. The next set of parentheses outward indicate that Macaca is the sister to the Homo/Pongo group. And so on. </i”>

The nexus format is also widely used. The same tree shown above in newick format looks like the following in nexus format:

BEGIN TAXA;
DIMENSIONS NTAX = 5;
TAXLABELS
Homo
Pongo
Macaca
Ateles
Galago
;
END;
BEGIN TREES;
TRANSLATE
1       Homo,
2       Pongo,
3       Macaca,
4       Ateles,
5       Galago
;
TREE * UNTITLED = [&R] ((((1:0.21,2:0.21):0.28,3:0.49):0.13,4:0.62):0.38,5:1);
END;

This tree can be read from the file and plotted as follows:

apetree <- read.tree(url("https://www.zoology.ubc.ca/~schluter/R/csv/GreatApes.tre"))
plot(apetree)

Trait data

Trait data read from a text file should be stored in a data frame. Use the familiar commands in R to read the data, e.g.,

mydata <- read.csv("/directoryname/myfile.csv")

Species names

Make sure the species names are included in the data frame. They should match to the letter (and case) the species names in the mytree object that stores the tree information (the species names are in mytree$tip.label). So be sure to check carefully.

Row names

Some functions require that the row names of mydata be the species names. If the species names are contained in the variable species in mydata, set the rownames as follows

rownames(mydata) <- mydata$species

Row order

The species data in the data frame containing the trait measurements must be in the same row order as the names of the species in the tree object, mytree (species names are in mytree$tip.label). If this is not the case then you need to rearrange the rows of the data frame to match the order of species names in the tree object. The following command will do this for you; it assumes that the rownames of mydata are the species names:

mydata <- mydata[match(mytree$tip.label,rownames(mydata)),]

Plot continuous traits on trees

Values for a continuously-varying trait can be visualized at the tips of the tree using dots of varying size. A larger dot indicates a larger value for the trait. To begin, create a new data frame that contains only the numeric variables of interest (here, named x, y, and z).

mydat2 <- mydata[, c("x", "y", "z")]
dotTree(mytree, as.matrix(mydat2)[,c("x")]) # plot trait "x" at tree tips

Phylogenetic signal

Phylogenetic signal is the tendency for closely related species to be more similar than distantly related species. Phylogenetic signal in a continuously varying trait can be quantified using Pagel’s λ (“lambda”). This quantity ranges between 0 and 1, with 1 representing maximum phylogenetic signal. To begin, create a new data frame that contains only the numeric variables of interest (here, named x, y, and z).

mydat2 <- mydata[, c("x", "y", "z")]
phylosig(mytree, as.matrix(mydat2)[,c("x")]) # phylogenetic signal in trait "x"

Independent contrasts

This is a method that estimates and tests a regression or correlation between two variables while correcting for the nonindependence of data points resulting from phylogeny. It assumes that trait evolution mimics a Brownian motion process with unchanging rates through time and along all branches of the tree. This is a difficult assumption to verify.

Before beginning, make sure that the rows of the species data in mydata are in the same order as the names of the species in the tree object, mytree, namely mytree$tip.label. If not, see the previous section to rearrange.

PICs

Use the pic() function to transform the two variables x and y separately into phylogenetically independent contrasts (PICs), x1 and y1. Then fit a simple linear model using these contrasts but leave out the intercept term so that the fitted line passes through the origin. The regression is through the origin because each contrast is calculated as a difference (e.g., species A minus species B) and the direction of the difference is arbitrary (it could just as well have been calculated as species B minus species A).

x1 <- pic(mydata$x, mytree)
y1 <- pic(mydata$y, mytree)

View x1 and y1 before entering the next command. Check especially for values listed as “NaN” or “Inf”. They may result if some of the tip branches in the tree have length 0 (the branch lengths are given in mytree$edge.length). See below for suggestions on what to do in this case.

If all is well, then fit the linear model through the origin. All the usual commands can be used to pull results from the fitted lm object, z.

z <- lm(y1 ~ x1 - 1)    # the "-1" forces line through origin

Correlation

Often the goal is to estimate a correlation coefficient rather than a regression coefficient. The method for calculating a correlation between independent contrasts differs from the usual calculation for a correlation because we need to account for the fact that the direction of the contrast is arbitrary. Both the following calculations should give the same answer. First calculate x1, y1, and z as in the PICs section above. Then calculate the correlation between independent contrasts as

r <- sum(x1*y1)/(sqrt(sum(x1^2))*sqrt(sum(y1^2)))  # or
r <- sqrt(summary(z)$r.squared)

Zero branch lengths

The PIC method in ape will crash if certain of the branches of the tree have length zero. If this happens to you, inspect the branch lengths.

mytree$edge.length        # lengths of all tree branches
range(mytree$edge.length) # shortest and longest branch

Try adding a very small number to each of the branches to prevent crashes caused by zero branch lengths. Here I add a constant “0.001” to each branch.

mytree$edge.length <-  mytree$edge.length + 0.001

Retry the PICs method to see if this fixes the problem.

General least squares method

General least squares (GLS) is a linear model technique that allows non-independent data points to be fitted when the expected similarity (“correlation”) between data points is known. In the present case, the phylogeny provides the expected correlations, assuming that trait evolution conforms to a Brownian motion or other process. The method gives identical results to PICs with two variables undergoing Brownian motion.

The advantage of GLS instead of PICs is that it provides a more flexible framework for analyzing trait evolution. Any linear model can be corrected for phylogeny. It also allows the use of other evolutionary models, not just Brownian motion. Diagnostics (e.g., residual and scatter plots based on the phylogenetically transformed data) are harder to obtain but we solve this problem below.

Phylogenetic correlation matrix

GLS uses a matrix containing the expected correlation between trait values of all pairs of species. The expected correlation is proportional to the amount of evolutionary history, from root to tips, that they share through common descent. Basically, the phylogenetic correlation matrix re-weights the data from each species in a linear model. The vcv() function calculates the matrix corresponding to a given phylogenetic tree under the Brownian motion model of evolution.

For example, the phylogenetic correlation matrix for the tree of great apes read earlier is as follows (assuming Brownian motion). Other evolutionary models are described below.

v <- vcv(apetree, corr = TRUE)
v

##        Homo Pongo Macaca Ateles Galago
## Homo   1.00  0.79   0.51   0.38      0
## Pongo  0.79  1.00   0.51   0.38      0
## Macaca 0.51  0.51   1.00   0.38      0
## Ateles 0.38  0.38   0.38   1.00      0
## Galago 0.00  0.00   0.00   0.00      1

Using GLS

The gls() command is from the nlme library. The variables x and y are assumed to be in the data frame mytree. We are going back to the data here and are not using the PICs calculated earlier. Since we are modeling the data directly rather than the contrasts, we don’t remove the intercept term from the model formula.

z <- gls(y ~ x, data=mydata, correlation=corBrownian(1, mytree))

The results are stored in the gls() model object z. This object behaves similarly to a lm() model object from ordinary linear model fitting, and a number of the same commands are used to pull out results.

summary(z) # coefficients, SE's, fit statistics
confint(z) # confidence intervals for coefficients

Diagnostic plots for GLS

The GLS method essentially transforms the variables in your linear model to a new scale where all the usual assumptions of linear models – independent residuals having equal variance — are met (assuming that your model of evolution is the correct one). GLS then fits an ordinary linear model to these transformed variables.

You can evaluate linear model assumptions by making scatter plots and residual plots of these transformed variables from a GLS analysis using the lm.gls() function at the end of this page. Cut and paste the code for the function into your R command console. You’ll need to load the visreg package too.

I illustrate using data from Rolland et al (2020) “Vulnerability to fishing and life history traits correlate with the load of deleterious mutations in teleosts”, Molecular Biology and Evolution 37: 2192–2196. The linear model will fit an estimate of deleterious mutation accumulation in fish species to a measure of fish species vulnerability in the face of human exploitation.

# read the tree
fishtree <- read.tree(url("https://www.zoology.ubc.ca/~schluter/R/csv/fishtree.tre"))
fishtree

## 
## Phylogenetic tree with 65 tips and 64 internal nodes.
## 
## Tip labels:
##   Astyanax_mexicanus, Danio_rerio, Gasterosteus_aculeatus, Myoxocephalus_scorpius, Sebastes_norvegicus, Chaenocephalus_aceratus, ...
## 
## Rooted; includes branch lengths.

# read the data
fishdat <- read.csv(url("https://www.zoology.ubc.ca/~schluter/R/csv/fishdat.csv"), 
                  row.names = 1)
head(fishdat)

##                                 dNdS ss.Vulnerability
## Anabas_testudineus        0.08792156            12.47
## Antennarius_striatus      0.07221790            13.16
## Arctogadus_glacialis      0.10974063            31.13
## Astyanax_mexicanus        0.05488084            10.00
## Bathygadus_melanobranchus 0.10815480            54.74
## Benthosema_glaciale       0.07361320            27.28

# Calculate the phylogenetic correlation matrix
v <- vcv(fishtree,  model = "Brownian", corr = TRUE)

# Fit the GLS model using the lm.gls() function
z <- lm.gls(x = fishdat$ss.Vulnerability, y = fishdat$dNdS, v = v)

# The output will include the fitted model object "lm.fit" and a data frame of results
names(z)

## [1] "lm.fit" "result"

Use summary() to extract coefficients from the fitted model object and anova() to test factors.

summary(z$lm.fit)
anova(z$lm.fit)

In results data frame, x is the transformed predictor variable of interest and ynew is the transformed response variable. yhat are the predicted values and resid are the residuals.

head(z$result)

##   intercept         x         ynew        yhat        resid
## 1 0.8287927  10.15330  0.077005090  0.05926235  0.017742739
## 2 0.8287927  11.08866  0.055717328  0.05991080 -0.004193474
## 3 0.1766355   2.66602  0.064105281  0.01297831  0.051126967
## 4 0.1766355 -39.79760 -0.046142947 -0.01646017 -0.029682779
## 5 0.1785635  36.81465  0.053106918  0.03677380  0.016333115
## 6 0.1785635 -11.79493 -0.008038477  0.00307455 -0.011113028

Use visreg to show a scatter plot of the transformed variables.

visreg(z$lm.fit, "x", ylab = "Corrected dN/dS", xlab = "Corrected Vulnerability")

A residual plot is obtained by comparing resid with yhat.

plot(z$result$resid ~ z$result$yhat)

Alternative evolutionary models

Under the strict Brownian motion model, the expected correlation ρ_ij between the trait values of any two species i and j is measured by the fraction of total phylogenetic history, from root to tip, that they share. An alternative possibility is that every correlation is weaker by the same amount, λ*ρ_ij, where λ is a constant between 0 and 1. A λ value less than 1 implies that the effect of phylogeny is not as strong as predicted by Brownian motion, as though each species has experienced an extra bit of independent evolution since it split from its ancestor. If λ = 0 then there is no phylogenetic signal at all. The approach multiplies all the off-diagonal values of the phylogenetic correlation matrix by λ and then fits a GLS model using this modified correlation matrix.

The method is simple to implement in ape. The three commands below fit models to the data in which λ is 1, 0, and 0.5.

z <- gls(y ~ x, data=mydata, correlation=corPagel(1, mytree, fixed = TRUE))
z <- gls(y ~ x, data=mydata, correlation=corPagel(0, mytree, fixed = TRUE))
z <- gls(y ~ x, data=mydata, correlation=corPagel(0.5, mytree, fixed = TRUE))

The following command finds and fits the maximum likelihood value for λ, given the data. The best-fit value of λ is a useful guide to the extent to which species differences are predicted by phylogeny.

z <- gls(y ~ x, data=mydata, correlation=corPagel(1, mytree, fixed = FALSE))

For diagnostic plots (scatter and residual plots) use the lm.gls() as above but use a modified phylogenetic correlation matrix.

v <- vcv(corPagel(1, fishtree, fixed=FALSE)) # or use Fixed = TRUE and supply a lambda value

The Ornstein–Uhlenbeck process

Under Brownian motion, the expected or average difference between species is proportional to the amount of time since they split from a common ancestor. This assumes that species can evolve ever greater differences without constraint. However, real evolution seems to have bounds, even if they are soft. The Ornstein–Uhlenbeck (OU) process attempts to capture this idea of limits or constraints in a modified random walk that includes an “attractor” or “optimum.” (The process is sometimes said to represent a model of “stabilizing selection”, but this is confusing because the attractor isn’t an optimum in the population genetic sense.) Under an OU process, the farther a species trait evolves away from the attractor, the stronger the tendency for the next step in its random walk to be toward the attractor rather than away from it.

Use gls() to fit an OU model to the data as follows.

z <- gls(y ~ x, data=mydata, correlation=corMartins(1, mytree, fixed=TRUE))  # λ = 1
z <- gls(y ~ x, data=mydata, correlation=corMartins(1, mytree, fixed=FALSE)) # λ is estimated

For diagnostic plots (scatter and residual plots) use the lm.gls() as above but use a modified phylogenetic correlation matrix for the OU process.

v <- vcv(corMartins(1, fishtree, fixed=TRUE))

gls.fit() function

To use this function, copy and paste the code into your console.

lm.gls <- function(x, y, v){
    # Function to implement general least squares
    # Assumes no missing values
    
    # x is a univariate vector or a matrix of variables
    # y is a vector, the response variable
    # v is the phylogenetic covariance matrix
    
    # There is a transformation of the original x and y variables
    # that would fulfill the assumptions of uncorrelated residuals having
    # equal variances (see Draper and Smith p 108 for this logic in the
    # case of weighted least squares for uncorrelated data; p156 for
    # the case with serially correlated data). The transformation (P)
    # is obtained by solving P'P=V as follows:
    
    # Generate the design matrix X, including column of 1's for intercept
    z <- lm(y ~ x)
    X <- model.matrix(z)
    colnames(X)[1] <- "intercept"
    
    
    # Calculate pv, the "square root" of v
    ve <- eigen(v)
    pv <- ve$vectors %*% sqrt(diag(ve$values)) %*% t(ve$vectors)
    
    # Transform y and X by pv
    X <- as.data.frame(solve(pv) %*% X)
    ynew <- solve(pv) %*% y
    
    # Fit the gls model, corrected for phylogeny
    f <- as.formula(paste("ynew ~", paste(names(X), collapse = " + "), "-1"))
    lm.fit <- lm(f, data = X)
    
    # Collect results
    yhat <- predict(lm.fit)
    resid <- residuals(lm.fit)  
    result <- data.frame(X, ynew, yhat, resid)
    
    return(list(lm.fit = lm.fit, result = result))
    }