In this workshop will explore some of the tools available in R for analyzing species as data points. For help with R commands to carry out resampling, see the phylogeny tab on the R tips help pages.
Rockfish (genus Sebastes) are a hugely diverse group of mainly Pacific Ocean fishes. There are over 100 recognized species inhabiting everything from the intertidal zone to waters over 2000 m deep. They have some of the highest longevities of any fish, with a maximum reported age of 205 years. Data on maximum body size (length), age (lifespan) and habitat depth of 56 species is provided here in a .csv file. The data were gathered from FishBase, from Love (2002), and by Travis Ingram.
A phylogenetic tree of the 56 species is provided in a text file here. The tree, from Hyde & Vetter (2007), is a consensus Bayesian tree based on 7 mitochondrial and 2 nuclear genes.
Download the file containing the phylogenetic tree. Inspect to determine whether it is in newick or nexus format and read into R using the appropriate command.
Plot commands in ape
will change the graphical
parameters from the defaults. Save a copy of the default values by
executing the following command: old.par <- par()
. We’ll
make use of old.par
later.
Plot the phylogeny. You may have to adjust an option to minimize overlapping the labels. Take a moment to admire the structure of the tree. Branch lengths are intended to reflect time. Does it look like the genus diversified mainly in a sudden early burst, a recent explosion, or a steady growth in the number of species through time? Notice that all the tips are contemporaneous.
Obtain the species measurements from the data file and input to a data frame. Check that the species names in the trait file and in the phylogenetic tree are identical and are listed in the same order (this is a requirement for the methods we will be using from the ape package).
Which of the traits show strong phylogenetic signal? First, use a method that places dots at the tips of the tree, with dot size reflecting the value for the trait. Is there a tendency for closely related species to be more similar in their trait values than more distantly related species?
Next, calculate Pagel’s \(\lambda\) for each trait. Which traits show strong phylogenetic signal, according to this metric?
library(ape)
library(phytools)
# 1. Download tree file
rtree <- read.tree(url("https://www.zoology.ubc.ca/~bio501/R/data/rockfish.phy"))
rtree
##
## Phylogenetic tree with 56 tips and 55 internal nodes.
##
## Tip labels:
## Sebastes.reedi, Sebastes.crameri, Sebastes.ciliatus, Sebastes.polyspinis, Sebastes.alutus, Sebastes.aurora, ...
##
## Rooted; includes branch lengths.
# 2. Save default graphical parameters
old.par <- par()
# 3. Plot phylogeny
plot(rtree, cex = 0.6)
# 4. Input species measurements
rockfish <- read.csv(url("https://www.zoology.ubc.ca/~bio501/R/data/rockfish.csv"),
stringsAsFactors = FALSE)
rownames(rockfish) <- rockfish$species
rockfish <- rockfish[ match(rtree$tip.label, rownames(rockfish)), ]
# 5. Plot traits
x <- rockfish[, c("maxlength.cm", "maxage.y", "maxdepth.m")]
dotTree(rtree, as.matrix(x)[,c("maxlength.cm")], cex = 0.5)
dotTree(rtree, as.matrix(x)[,c("maxage.y")], cex = 0.5)
dotTree(rtree, as.matrix(x)[,c("maxdepth.m")], cex = 0.5)
# 6. Pagel's lambda
phylosig(rtree, as.matrix(x)[,c("maxlength.cm")], method = "lambda")
##
## Phylogenetic signal lambda : 0.999927
## logL(lambda) : -236.493
phylosig(rtree, as.matrix(x)[,c("maxage.y")], method = "lambda")
##
## Phylogenetic signal lambda : 0.95213
## logL(lambda) : -275.213
phylosig(rtree, as.matrix(x)[,c("maxdepth.m")], method = "lambda")
##
## Phylogenetic signal lambda : 0.420141
## logL(lambda) : -387.161
Let’s begin with an analysis that ignores phylogeny, so that we have a baseline for comparison.
Restore the default graphical parameters by executing
par(old.par)
. It is not a huge issue if you skip this step
(give it a try and see what happens).
Inspect scatter plots of the species data. Make any necessary transformations here to help meet the assumptions of linear models*.
Choose one pair of variables in the data set to compare (suitably transformed). Carry out a linear model fit ignoring phylogeny for now.
Examine the diagnostic plots in (3) to check your assumptions.
Extract the regression coefficients and standard errors from the model object created in (3). Take note of the values for the slope estimate.
Obtain the correlation coefficient between the variables used in (3). Take note of the value of the estimate.
* Log-transforming the traits seems to be a good idea.
# Restore graphical parameters
par(old.par)
# 2. Inspect and transform
pairs(rockfish[, c("maxlength.cm", "maxage.y", "maxdepth.m")])
pairs(log(rockfish[, c("maxlength.cm", "maxage.y", "maxdepth.m")]))
rockfish$log.maxlength <- log(rockfish$maxlength.cm)
rockfish$log.maxage <- log(rockfish$maxage.y)
rockfish$log.maxdepth <- log(rockfish$maxdepth.m)
# 3. Linear model
z <- lm(log.maxage ~ log.maxlength, data = rockfish)
# 4. Diagnostic plots
plot(z)
# 5. Estimate
summary(z)
##
## Call:
## lm(formula = log.maxage ~ log.maxlength, data = rockfish)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1098 -0.3995 0.1121 0.4347 0.8134
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.2916 0.7101 -0.411 0.683
## log.maxlength 1.0495 0.1809 5.801 3.54e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5226 on 54 degrees of freedom
## Multiple R-squared: 0.384, Adjusted R-squared: 0.3725
## F-statistic: 33.66 on 1 and 54 DF, p-value: 3.543e-07
# 6. Correlation
cor(rockfish$log.maxage, rockfish$log.maxlength)
## [1] 0.6196418
Let’s use the same variables and apply phylogenetically independent contrasts (PICs) instead.
Convert the same two variables used in your TIPS analysis to phylogenetically independent contrasts. Create a scatter plot of the contrasts in your two variables. Are they associated?
Fit a linear model to the independent contrasts you created in (1). Use the contrasts corresponding to the same response and explanatory variables as in your TIPS analysis. Examine the diagnostic plots to check the linear model assumptions.
Extract the regression coefficients and standard errors from the model object created in (2). How does the slope* compare with that obtained in your TIPS analysis? Is the standard error from the PICs analysis greater than, less than, or the same as in your TIPS analysis? (Meta-analyses have often found that PICs yield a similar answer to an analysis that ignores phylogeny, but your specific case might or might not.)
Calculate the correlation coefficient** on the independent contrasts (consult the R tips pages if necessary).
* 1.19 when explanatory variable is log(maxlength.cm) and
response variable is log(maxage.y)
** 0.625
# 1. Contrasts
contrast.log.maxlength <- pic(rockfish$log.maxlength, rtree)
contrast.log.maxage <- pic(rockfish$log.maxage, rtree)
plot(contrast.log.maxage ~ contrast.log.maxlength)
# 2. Linear model independent contrasts
z1 <- lm(contrast.log.maxage ~ contrast.log.maxlength - 1)
plot(z1)
# 3. Estimate
summary(z1)
##
## Call:
## lm(formula = contrast.log.maxage ~ contrast.log.maxlength - 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7308 -0.2192 -0.0168 0.2147 0.6294
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## contrast.log.maxlength 1.1913 0.2025 5.884 2.61e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2918 on 54 degrees of freedom
## Multiple R-squared: 0.3907, Adjusted R-squared: 0.3794
## F-statistic: 34.62 on 1 and 54 DF, p-value: 2.614e-07
# 4. Correlation
sum(contrast.log.maxlength*contrast.log.maxage) /
(sqrt(sum(contrast.log.maxlength^2))*sqrt(sum(contrast.log.maxage^2)))
## [1] 0.6250418
Carry out the equivalent analysis to PICs using general least squares instead. Confirm that the slope coefficient and its standard error are identical to that obtained in the analysis of independent contrasts.
Examine the residual plot from the model fit in (1). Notice that the mean of the residuals is not zero (calculate the mean of the residuals to confirm this). This is because the GLS analysis does not weight each of the observations equally.
(In effect, GLS fits a linear model to the variables after transforming them according to values computed from the phylogenetic correlation matrix. A disadvantage of the residual plot here is that we aren’t seeing the diagnostic plot for the transformed variables, which would be nice. These can be calculated “by hand” but it is no fun*.)
* Actually, it is quite fun.
library(nlme)
# 1. GLS
z2 <- gls(log.maxage ~ log.maxlength, data= rockfish, correlation=corBrownian(1, rtree))
summary(z2)
## Generalized least squares fit by REML
## Model: log.maxage ~ log.maxlength
## Data: rockfish
## AIC BIC logLik
## 97.16098 103.1279 -45.58049
##
## Correlation Structure: corBrownian
## Formula: ~1
## Parameter estimate(s):
## numeric(0)
##
## Coefficients:
## Value Std.Error t-value p-value
## (Intercept) -0.6998314 0.8811189 -0.794253 0.4305
## log.maxlength 1.1912722 0.2024551 5.884131 0.0000
##
## Correlation:
## (Intr)
## log.maxlength -0.945
##
## Standardized residuals:
## Min Q1 Med Q3 Max
## -1.57871076 -0.61756653 -0.05523672 0.29014355 0.75693801
##
## Residual standard error: 0.8205782
## Degrees of freedom: 56 total; 54 residual
# 2. Residual plot
plot(z2)
Does incorporating phylogeny result in a better fit to the data than ignoring phylogeny? Use Pagel’s \(\lambda\) to help decide this. Using GLS, fit your model again while fixing Pagel’s \(\lambda\) = 1. Refit, but this time using \(\lambda\) = 0. Which lambda fits best? How large is the difference in their AIC scores?
Find the maximum-likelihood estimate of \(\lambda\)*. Is the AIC score improved when a linear model is fitted using this maximum likelihood value for \(\lambda\) ?
Repeat the analyses with the other pairs of variables to decide whether including phylogeny generally improves the fit of linear models to these data.
*0.583 This value is lower than the largest value you calculated earlier on one of the traits. This is likely because \(\lambda\) is now a compromise between the phylogenetic signals in two traits evolving together.
library(ape)
# 1. Pagel's lambda
z3 <- gls(log.maxage ~ log.maxlength, data= rockfish,
correlation=corPagel(1,rtree, fixed = TRUE))
z4 <- gls(log.maxage ~ log.maxlength, data= rockfish,
correlation=corPagel(0,rtree, fixed = TRUE))
AIC(z3, z4)
## df AIC
## z3 3 97.16098
## z4 3 95.30852
# 2. ML estimate of lambda
z5 <- gls(log.maxage ~ log.maxlength, data= rockfish,
correlation=corPagel(1,rtree, fixed = FALSE))
summary(z5)
## Generalized least squares fit by REML
## Model: log.maxage ~ log.maxlength
## Data: rockfish
## AIC BIC logLik
## 93.04107 100.997 -42.52054
##
## Correlation Structure: corPagel
## Formula: ~1
## Parameter estimate(s):
## lambda
## 0.5834147
##
## Coefficients:
## Value Std.Error t-value p-value
## (Intercept) -0.3687913 0.7846381 -0.470015 0.6402
## log.maxlength 1.1005782 0.1892718 5.814803 0.0000
##
## Correlation:
## (Intr)
## log.maxlength -0.976
##
## Standardized residuals:
## Min Q1 Med Q3 Max
## -2.1603557245 -0.8533628093 0.0002310064 0.4942494863 1.1386086123
##
## Residual standard error: 0.5769972
## Degrees of freedom: 56 total; 54 residual
AIC(z5)
## [1] 93.04107
# 3. Same for other pairs of variables
z6 <- gls(log.maxdepth ~ log.maxlength, data= rockfish,
correlation=corPagel(1,rtree, fixed = TRUE))
z7 <- gls(log.maxdepth ~ log.maxlength, data= rockfish,
correlation=corPagel(0,rtree, fixed = TRUE))
z8 <- gls(log.maxdepth ~ log.maxlength, data= rockfish,
correlation=corPagel(1,rtree, fixed = FALSE))
summary(z8)
## Generalized least squares fit by REML
## Model: log.maxdepth ~ log.maxlength
## Data: rockfish
## AIC BIC logLik
## 119.9422 127.8981 -55.9711
##
## Correlation Structure: corPagel
## Formula: ~1
## Parameter estimate(s):
## lambda
## 0.5561114
##
## Coefficients:
## Value Std.Error t-value p-value
## (Intercept) 4.975487 1.001192 4.969563 0.0000
## log.maxlength 0.293817 0.242039 1.213923 0.2301
##
## Correlation:
## (Intr)
## log.maxlength -0.977
##
## Standardized residuals:
## Min Q1 Med Q3 Max
## -3.3321810 -0.6266323 -0.1840538 0.2067991 1.1544947
##
## Residual standard error: 0.7325487
## Degrees of freedom: 56 total; 54 residual
AIC(z6, z7, z8)
## df AIC
## z6 3 133.3345
## z7 3 132.5777
## z8 4 119.9422
© 2009-2024 Dolph Schluter