Phylogenetic tools

The ape package in R contains tools of phylogenetic comparative methods. For more details on the package see the online book by E. Paradis, Analysis of phylogenetics and evolution with R. A link to this book can be found on the "textbooks" tab of the course main web pages.

Load the ape library to begin. You may also need to load the nlme library.

library(ape)

Phylogenetic trees

This section explains commands to read and write files containing phylogenetic trees, and to display trees in the plot window in R.

Read tree from file

Phylogenetic trees are usually represented in one of two text formats, newick and nexus. The formats are explained below. To read a tree contained in a text file, use the corresponding command.

mytree <- read.tree(file.choose()) # newick format

mytree <- read.nexus(file.choose()) # nexus format

The object "mytree" is a special type of list. The following commands access some of the information stored in "mytree".

mytree # prints basic tree information

mytree$tip.label # species names

mytree$edge.length # lengths of all tree branches

mytree$edge # identity of branches (from node, to node)

View tree

Use the plot command to view a phylogeny stored in the object "mytree". Include the "cex" option to help reduce overlap of species labels.

plot(mytree,cex=0.7)

Write trees to file

write.tree(file.choose()) # newick format

write.nexus(file.choose()) # nexus format

Tree formats

The newick format is used by most programs that estimate phylogenies. Trees in this format look like the following:

((((Homo:0.21,Pongo:0.21):0.28,Macaca:0.49):0.13,

Ateles:0.62):0.38,Galago:1.00);

Each species at the tips of the phylogeny has a name followed by a number indicating the length of the branch connecting it to its immediate ancestor. Sister taxa -- those sharing an immediate common ancestor -- are wrapped by parentheses. In the above example, the innermost parentheses indicate that Homo and Pongo are sister taxa. The next set of parentheses outward indicate that Macaca is the sister to the Homo/Pongo group. And so on.

The nexus format is also widely used (Felsenstein's "Phylip" programs use only the newick format). The same tree shown above in newick format looks like the following in nexus format:

BEGIN TAXA;

DIMENSIONS NTAX = 5;

TAXLABELS

Homo

Pongo

Macaca

Ateles

Galago

;

END;

BEGIN TREES;

TRANSLATE

1 Homo,

2 Pongo,

3 Macaca,

4 Ateles,

5 Galago

;

TREE * UNTITLED = [&R] ((((1:0.21,2:0.21):0.28,3:0.49):0.13,4:0.62):0.38,5:1);

END;

Trait data

Trait data read from a text file should be stored in a data frame. Use the familiar commands in R to read the data, e.g.,

mydata <- read.csv(file.choose(),stringsAsFactors=FALSE)

Species names

Make sure the species names are included in the data frame. They should match to the letter (and case) the species names in the "mytree" object that stores the tree information (the species names are in mytree$tip.label). So be sure to check carefully.

Row names

Although not crucial, it can be convenient to change the row names of "mydata" to be the species names. For example, if the variable "species" in "mydata" contains the species names, set the rownames as follows

rownames(mydata) <- mydata$species

Row order

The species data in the data frame containing the trait measurements must be in the same row order as the names of the species in the tree object, "mytree" (species names are in mytree$tip.label). If this is not the case then you need to rearrange the rows of the data frame to match the order of species names in the tree object. The following command will do this for you; it assumes that the rownames of "mydata" are the species names:

mydata <- mydata[match(mytree$tip.label,rownames(mydata)),]

Phylogenetically independent contrasts (PIC)

This is a method that corrects for the nonindependence of data points resulting from phylogeny. It assumes that trait evolution mimics a Brownian motion process with unchanging rates through time and along all branches of the tree. This is a difficult assumption to verify.

Before beginning, make sure that the rows of the species data in "mydata" are in the same order as the names of the species in the tree object, "mytree", namely mytree$tip.label. If not, see the previous section to rearrange.

PICs

The method transforms the two variables x and y separately into phylogenetically independent contrasts, x1 and y1. Then fit a linear model using these contrasts but leave out the intercept term so that the fitted line passes through the origin. The regression is though the origin because each contrast is calculated as a difference (e.g., species A minus species B) and the direction of the difference is arbitrary (it could just as well have been calculated as species B minus species A).

x1 <- pic(mydata$x, mytree)

y1 <- pic(mydata$y, mytree)

View x1 and y1 before entering the next command. Check especially for values listed as "NaN" or "Inf". They may result if some of the tip branches in the tree have length 0 (the branch lengths are given in mytree$edge.length). See below for suggestions on what to do in this case.

If all is well, then fit the linear model through the origin. All the usual commands can be used to pull results from the fitted lm object, z.

z <- lm(y1 ~ x1 - 1) # the "-1" forces line through origin

Correlation

Often the goal is to estimate a correlation coefficient rather than a regression coefficient. The method for calculating a correlation between independent contrasts differs from the usual calculation for a correlation because we need to account for the fact that the direction of the contrast is arbitrary. Both the following calculations should give the same answer. First calculate x1, y1, and z as in the PICs section above. Then calculate the correlation between independent constrasts as

r <- sum(x1y1)/(sqrt(sum(x1^2))sqrt(sum(y1^2))) # or

r <- sqrt(summary(z)$r.squared)

Zero branch lengths

The PIC method in "ape" will crash if certain of the branches of the tree have length zero. If this happens to you, inspect the branch lengths.

mytree$edge.length # lengths of all tree branches

range(mytree$edge.length) # shortest and longest branch

Try adding a very small number to each of the branches to prevent crashes caused by zero branch lengths. Here I add a constant "0.001" to each branch.

mytree$edge.length <- mytree$edge.length + 0.001

Retry the PICs method to see if this fixes the problem.

General least squares method

General least squares (GLS) is a linear model technique that allows non-independent data points to be fitted when the expected similarity ("correlation") between data points is known. In the present case, phylogeny provides the expected correlations, assuming that trait evolution conforms to a Brownian motion process. The basic method gives identical results to PICs. The advantage of analyzing the data using GLS instead of PICs is that it provides a more flexible framework for analyzing trait evolution.

Phylogenetic correlation matrix

GLS uses a matrix containing the expected correlations between the trait values of all pairs of species. The expected correlation between any pair is the proportion of evolutionary history, from root to tips, that they share through common descent. The phylogenetic correlation matrix is used to weight the data from each species in a linear model. You can view the matrix corresponding to a given phylogenetic tree using the following command:

vcv.phylo(mytree, cor=TRUE)

Here is the phylogenetic correlation matrix corresponding to the example tree above. The diagonal elements of the correlation matrix are "1". The off-diagonal elements give the fraction of the total tree, from root to tip, that is shared between a given pair of species i and j. For example, Homo amd Pongo share a great deal of their histories (ρ = 0.79), prior to branching apart. At the other extreme, Galago shares none of its history with any other species in this tree (ρ = 0).

	Homo	Pongo	Macaca	Ateles	Galago
Homo	1	0.79	0.51	0.38	0
Pongo	0.79	1	0.51	0.38	0
Macaca	0.51	0.51	1	0.38	0
Ateles	0.38	0.38	0.38	1	0
Galago	0	0	0	0	1

If the off-diagonal elements of the correlation matrix are all zero, then this indicates a "star phylogeny" and the results of GLS will be the same as those from an ordinary linear model that ignores phylogeny.

Using GLS

The gls command is from the nlme library. The variables x and y are assumed to be in the data frame "mydata". We are going back to the data here and are not using the PICs calculated earlier. Since we are modeling the data directly rather than the contrasts, we don't remove the intercept term from the model formula.

z <- gls(y ~ x, data=mydata, correlation=corBrownian(1,mytree))

The results are stored in the gls model object z. This object behaves similarly to a lm model object from ordinary linear model fitting, and a number of the same commands are used to pull out results.

plot(z) # residual plot

summary(z) # coefficients, SE's, fit statistics
confint(z) # confidence intervals for coefficients

Pagel's λ

The GLS approach makes it easy to ask whether it is even necessary to take phylogeny into account when fitting a linear model to species data. Pagel's λ ("lambda") is one way to evaluate the importance of phylogeny to species data. The method is founded on the assumption of a modified Brownian motion process.

Under the strict Brownian motion model, the expected correlation ρ_ij between the trait values of any two species i and j is measured by the fraction of total phylogenetic history, from root to tip, that they share. An alternative possibility is that every correlation is weaker by the same amount, λ*ρ_ij, where λ is a constant between 0 and 1. A λ value less than 1 implies that the effect of phylogeny is not as strong as that expected under a strict Brownian process, as though each species has experienced an extra bit of independent evolution since it split from its ancestor. If λ = 0 then there is no effect of phylogeny at all. λ is fitted by multiplying all the off-diagonal values of the phylogenetic correlation matrix by λ and then fitting a GLS model with this modified correlation matrix.

The method is simple to implement in ape. The three commands below fit models to the data in which λ is 1, 0, and 0.5.

z <- gls(y ~ x, data=mydata, correlation=corPagel(1,mytree,

fixed = TRUE))

z <- gls(y ~ x, data=mydata, correlation=corPagel(0,mytree,

fixed = TRUE))

z <- gls(y ~ x, data=mydata, correlation=corPagel(0.5,mytree,

fixed = TRUE))

The following command finds and fits the maximum likelihood value for λ, given the data. The best-fit value of λ is a useful guide to the extent to which species differences are predicted by phylogeny.

Phylogenetic tools

library(ape)

Phylogenetic trees

Read tree from file

mytree <- read.tree(file.choose()) # newick format

mytree <- read.nexus(file.choose()) # nexus format

mytree # prints basic tree information

mytree$tip.label # species names

mytree$edge.length # lengths of all tree branches

mytree$edge # identity of branches (from node, to node)

View tree

plot(mytree,cex=0.7)

Write trees to file

write.tree(file.choose()) # newick format

write.nexus(file.choose()) # nexus format

Tree formats

((((Homo:0.21,Pongo:0.21):0.28,Macaca:0.49):0.13,

Ateles:0.62):0.38,Galago:1.00);

BEGIN TAXA;

DIMENSIONS NTAX = 5;

TAXLABELS

Homo

Pongo

Macaca

Ateles

Galago

;

END;

BEGIN TREES;

TRANSLATE

1 Homo,

2 Pongo,

3 Macaca,

4 Ateles,

5 Galago

;

TREE * UNTITLED = [&R] ((((1:0.21,2:0.21):0.28,3:0.49):0.13,4:0.62):0.38,5:1);

END;

Trait data

mydata <- read.csv(file.choose(),stringsAsFactors=FALSE)

Species names

Row names

rownames(mydata) <- mydata$species

Row order

mydata <- mydata[match(mytree$tip.label,rownames(mydata)),]

Phylogenetically independent contrasts (PIC)

PICs

x1 <- pic(mydata$x, mytree)

y1 <- pic(mydata$y, mytree)

z <- lm(y1 ~ x1 - 1) # the "-1" forces line through origin

Correlation

r <- sum(x1*y1)/(sqrt(sum(x1^2))*sqrt(sum(y1^2))) # or

r <- sqrt(summary(z)$r.squared)

Zero branch lengths

mytree$edge.length # lengths of all tree branches

range(mytree$edge.length) # shortest and longest branch

mytree$edge.length <- mytree$edge.length + 0.001

General least squares method

Phylogenetic correlation matrix

vcv.phylo(mytree, cor=TRUE)

Using GLS

z <- gls(y ~ x, data=mydata, correlation=corBrownian(1,mytree))

plot(z) # residual plot

summary(z) # coefficients, SE's, fit statistics confint(z) # confidence intervals for coefficients

Pagel's λ

z <- gls(y ~ x, data=mydata, correlation=corPagel(1,mytree,

fixed = TRUE))

z <- gls(y ~ x, data=mydata, correlation=corPagel(0,mytree,

fixed = TRUE))

z <- gls(y ~ x, data=mydata, correlation=corPagel(0.5,mytree,

fixed = TRUE))

z <- gls(y ~ x, data=mydata, correlation=corPagel(1,mytree,

fixed = FALSE))

r <- sum(x1y1)/(sqrt(sum(x1^2))sqrt(sum(y1^2))) # or

summary(z) # coefficients, SE's, fit statistics
confint(z) # confidence intervals for coefficients